Atherosclerotic Cardiovascular Disease (ASCVD), which encompasses coronary heart disease, cerebrovascular disease, and peripheral arterial disease, is a leading cause of morbidity and mortality worldwide. Current ASCVD risk assessment tools, while useful, have limitations and may not adequately consider the complex, multifactorial nature of the disease. As such, there is a growing need for more sophisticated predictive models that integrate a wider array of clinical and demographic variables to identify individuals at risk for ASCVD more accurately and earlier in the disease process.
Our goal is to predict 10-year ASCVD risk in adults using key features such as age, gender, race, smoking status, diabetes, hypertension, and cholesterol levels. The dataset aims to facilitate accurate risk assessments and guide targeted preventive healthcare interventions.
The objectives of the project include:
Employing decision tree learning algorithms that can uncover intricate patterns and interactions among diverse risk factors.
Enhancing the precision and personalization of ASCVD risk prediction beyond what is possible with conventional risk assessment tools.
To employ an advanced predictive model for ASCVD risk assessment decision tree learning and clustering algorithms that can identify complex patterns in our dataset, capturing interactions among a multitude of risk factors for ASCVD.
It consist of 1000 row that each have 10 attributes.
“Risk”; 10-year risk for ASCVD which is categorized as:
Low-risk (<5%)
Borderline risk (5% to 7.4%)
Intermediate risk (7.5% to 19.9%)
High risk (≥20%)
## Attribute_Name Description Data_Type
## 1 isMale Gender Binary
## 2 isBlack Race Binary
## 3 isSmoker Smoking Status Binary
## 4 isDiabetic Diabetes Status Binary
## 5 isHypertensive Hypertension Status Binary
## 6 Age Age of the candidate Numeric (Integer)
## 7 Systolic Max Blood Pressure Numeric (Integer)
## 8 Cholesterol Total Cholesterol Numeric (Integer)
## 9 HDL HDL Cholesterol Numeric (Integer)
## 10 Risk (class label) 10-year ASCVD Risk Numeric (Decimal)
## Possible_Values
## 1 0 (Female), 1 (Male)
## 2 0 (Not Black), 1 (Black)
## 3 0 (Non-smoker), 1 (Smoker)
## 4 0 (Normal), 1 (Diabetic)
## 5 0 (Normal BP), 1 (High BP)
## 6 Range between 40-79
## 7 Range between 90-200
## 8 Range between 130-200
## 9 Range between 20-100
## 10 Low, Borderline, Intermediate, High risk
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
dataset <- read.csv("heartRisk.csv")
head(dataset)
str(dataset)
## 'data.frame': 1000 obs. of 10 variables:
## $ isMale : int 1 0 0 1 0 0 1 1 0 1 ...
## $ isBlack : int 1 0 1 1 0 0 0 0 0 0 ...
## $ isSmoker : int 0 0 1 1 1 1 1 1 1 0 ...
## $ isDiabetic : int 1 1 1 1 0 0 0 1 0 1 ...
## $ isHypertensive: int 1 1 1 0 1 1 0 0 1 1 ...
## $ Age : int 49 69 50 42 66 52 40 75 42 65 ...
## $ Systolic : int 101 167 181 145 134 154 104 136 169 196 ...
## $ Cholesterol : int 181 155 147 166 199 174 187 189 179 187 ...
## $ HDL : int 32 59 59 46 63 22 52 59 99 46 ...
## $ Risk : num 11.1 30.1 37.6 13.2 15.1 17.3 2.1 46 1.7 48.5 ...
dim(dataset)
## [1] 1000 10
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
describe(dataset)
## dataset
##
## 10 Variables 1000 Observations
## --------------------------------------------------------------------------------
## isMale
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.75 490 0.49 0.5003
##
## --------------------------------------------------------------------------------
## isBlack
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.747 530 0.53 0.4987
##
## --------------------------------------------------------------------------------
## isSmoker
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.749 516 0.516 0.5
##
## --------------------------------------------------------------------------------
## isDiabetic
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.749 522 0.522 0.4995
##
## --------------------------------------------------------------------------------
## isHypertensive
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.75 495 0.495 0.5005
##
## --------------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 40 0.999 59.11 13.32 42 43
## .25 .50 .75 .90 .95
## 49 59 69 75 77
##
## lowest : 40 41 42 43 44, highest: 75 76 77 78 79
## --------------------------------------------------------------------------------
## Systolic
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 111 1 144.2 36.69 95 102
## .25 .50 .75 .90 .95
## 117 144 171 189 194
##
## lowest : 90 91 92 93 94, highest: 196 197 198 199 200
## --------------------------------------------------------------------------------
## Cholesterol
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 71 1 164 23.48 133 136
## .25 .50 .75 .90 .95
## 146 164 182 192 196
##
## lowest : 130 131 132 133 134, highest: 196 197 198 199 200
## --------------------------------------------------------------------------------
## HDL
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 81 1 59.6 27.56 23 27
## .25 .50 .75 .90 .95
## 39 59 81 93 97
##
## lowest : 20 21 22 23 24, highest: 96 97 98 99 100
## --------------------------------------------------------------------------------
## Risk
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 439 1 19.67 18.37 1.20 2.20
## .25 .50 .75 .90 .95
## 6.30 14.40 29.00 45.13 55.30
##
## lowest : 0.1 0.2 0.3 0.4 0.5 , highest: 76.5 76.8 78.1 78.5 85.4
## --------------------------------------------------------------------------------
summary(dataset)
## isMale isBlack isSmoker isDiabetic isHypertensive
## Min. :0.00 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.00 Median :1.00 Median :1.000 Median :1.000 Median :0.000
## Mean :0.49 Mean :0.53 Mean :0.516 Mean :0.522 Mean :0.495
## 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.00 Max. :1.00 Max. :1.000 Max. :1.000 Max. :1.000
## Age Systolic Cholesterol HDL Risk
## Min. :40.00 Min. : 90.0 Min. :130 Min. : 20.0 Min. : 0.10
## 1st Qu.:49.00 1st Qu.:117.0 1st Qu.:146 1st Qu.: 39.0 1st Qu.: 6.30
## Median :59.00 Median :144.0 Median :164 Median : 59.0 Median :14.40
## Mean :59.11 Mean :144.2 Mean :164 Mean : 59.6 Mean :19.67
## 3rd Qu.:69.00 3rd Qu.:171.0 3rd Qu.:182 3rd Qu.: 81.0 3rd Qu.:29.00
## Max. :79.00 Max. :200.0 Max. :200 Max. :100.0 Max. :85.40
var(dataset$Age)
## [1] 133.0906
var(dataset$Systolic)
## [1] 1009.621
var(dataset$Cholesterol)
## [1] 413.3045
var(dataset$HDL)
## [1] 569.4669
var(dataset$Risk)
## [1] 290.4959
All the attributes’ variance results are higher than their mean values, which implies that the dataset has greater variability and is more heterogeneous. This might indicate that the values in our dataset are more scattered; have a wider range of values, potentially suggesting a more diverse or varied pattern in the data.
library(ggplot2)
ggplot(dataset, aes(x = Age, y =Systolic, color= 'red'))+
geom_point() +
xlab("Age") +
ylab("Blood Pressure")
In order to gain a deeper understanding of our dataset, we examined the attributes “Systolic” and “Age” to determine if there was a predictive or correlational relationship between them. However, after analyzing the scatter plot, we discovered that there is no discernible relationship or correlation between these two attributes.
ggplot(dataset, aes(x = Systolic, y = Risk)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(color = "Regression Line")) +
facet_wrap(~cut(Age, 3), scales = "free") +
xlab("Systolic Blood Pressure") +
ylab("Risk") +
ggtitle("Relationship between Systolic Blood Pressure and Risk at Different Age Levels") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
However, notable association between ‘Systolic Blood Pressure’, ‘Age’, and ‘Risk’, segmented into various age categories. It shows that risk notably rises with age and Blood Pressure,as the regression line for the age bracket (66,79] exhibits higher risks, indicating a high correlation between advancing age and elevated risk levels in this dataset.
library(tidyr)
dataset_long <- gather(dataset, key = "column", value = "value", Age:ncol(dataset))
ggplot(dataset_long, aes(x = value, fill = column)) +
geom_density(alpha = 0.7) +
facet_wrap(~column, scales = "free") +
xlab("Value") +
ylab("Density")
To understand the relative frequency of different values within our dataest we measeured the density, and analyzed the corresponding graphs. Here are the observations we made:
- The graph representing the distribution of ages shows a reasonable representation of ages between 40 and 80 within the dataset. This suggests that the age values are well-distributed within this range.
- Both the density graphs for cholesterol and HDL indicate a slight skew towards lower cholesterol levels. This suggests that the majority of the data points tend to have lower cholesterol values rather than higher ones.
- The density graph for systolic blood pressure displays a uniform distribution across the entire range of blood pressures. This indicates that the data points are evenly spread out without any significant concentration in specific pressure ranges.
- The density graph for the risk variable exhibits a positively skewed (right-skewed) distribution. This implies that there is a higher frequency of data points with lower risk values, while the occurrence of higher risk values is relatively less frequent.
bb <- dataset$isSmoker %>% table() %>%
barplot(bb , col = c("lightgreen","darkred"), width= c(4,4.1),space=0.1, names.arg=c("o","1"), legend.text = c("Non-Smoker","Smoker"))
To better understand the smoking status within our dataset, we visualized the data using a bar plot. This visualization was chosen to provide a clear and easily interpretable representation of the differences in smoking status. From the bar plot, we observed that the numbers are nearly evenly distributed between non-smokers (0) and smokers (1). This indicates that there is a balanced representation of individuals who are non-smokers and smokers in the dataset.
library(corrplot)
## corrplot 0.92 loaded
corr_matrix <- cor(dataset)
corrplot(corr_matrix, method = "color", type = "lower", tl.col = "black", tl.srt = 45,
addCoef.col = "black", number.cex = 0.7, tl.cex = 0.7, col = colorRampPalette(c("white", "lightblue"))(90))
## Warning in ind1:ind2: numerical expression has 2 elements: only the first used
By analyzing the correlation matrix of our dataset, we can identify suspicious events and patterns in the data. However, it is evident that there are no strong correlations among the features in the dataset. Despite this, we can rank the correlations in descending order based on their impact on the risk of heart disease.The order of correlations, from highest to lowest in terms of their influence on heart disease risk, is as follows: Age, Systolic blood pressure, is Diabetic, is Smoker, is Hypertensive, gender is male, , race is black, Cholestrol, HDL.
boxplot(dataset$Age)
The Age boxplot shows a wide range of values that might lead to a lower accuracy of the results when it comes to clculations so we need change it to a standardized range. Additionally, the boxplot analysis indicates that there are no outliers present in the Age attribute. This implies that the Age data points are within a reasonable range and do not deviate significantly from the overall distribution of values.
boxplot(dataset$Systolic)
The boxplot analysis of the Systolic blood pressure attribute reveals the absence of outliers, indicating that the data points are within a reasonable range without any extreme values. However, it is worth noting that the range of Systolic blood pressure is considerably large. To ensure accurate calculations and mitigate potential conflicts, it is recommended to transform the Systolic blood pressure into a smaller and standardized range. This transformation will help normalize the data and make it more suitable for analysis and calculations.
boxplot(dataset$Cholesterol)
According to the boxplot analysis of the Cholesterol attribute, no outliers are observed, suggesting that the data points are within a reasonable range without any extreme values. However, it is important to narrow down the range of values to optimize the accuracy of our calculations. By reducing the range of Cholesterol values, we can improve the reliability and precision of our dataset, enabling us to obtain more reliable and meaningful results.
boxplot(dataset$HDL)
The HDL boxplot reveal that there are no outlires shown. However, it is necessary to transform the range of HDL values to bring them into a standardized and common range. By performing this transformation, we can almost ensure to have better insights and improved data quality.
Since missing/null values can affect the dataset badly we decided to check it and delete all missing/null values from our dataset to make it as clean as possible so that we can end up with efficint dataset resulting to a higher possibiliaty of accurete results later on.
# Check for missing values
missing_values <- colSums(is.na(dataset))
# Print columns with missing values
print("Columns with missing values:")
## [1] "Columns with missing values:"
print(names(missing_values)[missing_values > 0])
## character(0)
# Print the count of missing values for each column
print("Count of missing values for each column:")
## [1] "Count of missing values for each column:"
print(missing_values)
## isMale isBlack isSmoker isDiabetic isHypertensive
## 0 0 0 0 0
## Age Systolic Cholesterol HDL Risk
## 0 0 0 0 0
In data analysis, checking and removing outliers is crucial to ensure the reliability of statistical insights. Outliers, as extreme data points, can distort summary statistics, potentially leading to inaccurate analyses. By identifying and, if necessary, removing outliers, we enhance the robustness of our findings.
# Compute IRQ
Q1 <- quantile(dataset$Age, 0.25)
Q3 <- quantile(dataset$Age, 0.75)
IQR <- Q3 - Q1
# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Age < lower_bound | dataset$Age > upper_bound)
# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Age outliers:", num_outliers))
## [1] "Number of Age outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$Systolic, 0.25)
Q3 <- quantile(dataset$Systolic, 0.75)
IQR <- Q3 - Q1
# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Systolic < lower_bound | dataset$Systolic > upper_bound)
# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Systolic outliers:", num_outliers))
## [1] "Number of Systolic outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$Cholesterol, 0.25)
Q3 <- quantile(dataset$Cholesterol, 0.75)
IQR <- Q3 - Q1
# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Cholesterol < lower_bound | dataset$Cholesterol > upper_bound)
# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Cholesterol outliers:", num_outliers))
## [1] "Number of Cholesterol outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$HDL, 0.25)
Q3 <- quantile(dataset$HDL, 0.75)
IQR <- Q3 - Q1
# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$HDL < lower_bound | dataset$HDL > upper_bound)
# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of HDL outliers:", num_outliers))
## [1] "Number of HDL outliers: 0"
The result indicates that there are no outliers, but we will also use a box plot to ensure that there are no outliers.
boxplot(dataset[,c(6,7,8,9)], main="Boxplot with Outliers", col=c("lightblue","lightblue","lightblue","lightblue"))
By using the box plot we can see that there are no outliers in the data set.
In analyzing the dataset,The initial dataset provided a comprehensive and relevant set of information for the research objectives without the need for removal or condensation of variables.
used the findCorrelation function in caret library to outputs the index of variables we need to delete. targeting any pair with a correlation coefficient exceeding 0.75.
findCorrelation(cor(dataset), cutoff=0.75)
## integer(0)
In our case, the function finds out that no feature need to be deleted.
Data normalization is a preprocessing step that involves transforming numerical data within a dataset to a standard, uniform scale. This process ensures that all variables, regardless of their original units or scales, are brought into a consistent and comparable range. the following attributes were selected for normalization:(age, systolic, cholestrol, HDL)
normalize <- function(x)
{
return ((x - min(x))/ (max(x)- min(x)) )
}
dataset$Age<-normalize(dataset$Age)
dataset$Systolic<-normalize(dataset$Systolic)
dataset$Cholesterol<-normalize(dataset$Cholesterol)
dataset$HDL<-normalize(dataset$HDL)
head(dataset)
we have successfully completed the data normalization. This process entailed scaling our numerical features to a standardized range, typically between 0 and 1.
To make our dataset understandable and easily interpretable, especially when using tree-based classification methods, we transformed the continuous class label ‘Risk’ into specific, categorized risk levels.
These levels are delineated as:
Low risk (<5%), Borderline risk (5% to 7.4%), Intermediate risk (7.5% to 19.9%), and High risk (≥20%).
# Categorize 'Risk' into defined categories
dataset$Risk <- cut(
dataset$Risk,
breaks = c(-Inf, 5, 7.4, 19.9, Inf),
labels = c("Low risk", "Borderline risk", "Intermediate risk", "High risk"),
right = FALSE,
include.lowest = TRUE
)
our dataset after Discretization:
head(dataset)
Feature selection is one of the most important task to boost performance of our machine learning model by removing irrelevant features the model will make decisions only using important features. we will use Recursive Feature Elimination (RFE), which is a widely used wrapper-type algorithm for selecting features that are most relevant in predicting the target variable ‘Risk’ in our case.
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-2
# ensure results are repeatable
set.seed(7)
# Define RFE control parameters
ctrl <- rfeControl(functions=rfFuncs, method="cv", number=10)
# Execute RFE using dataset features 1-9 and "Risk" as the class lable
results <- rfe(dataset[,1:9], dataset$Risk, sizes=c(1:9), rfeControl=ctrl)
# Display RFE results
print(results)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.5832 0.3887 0.03859 0.05413
## 2 0.5489 0.3401 0.03516 0.05332
## 3 0.6230 0.4335 0.03123 0.04525
## 4 0.6671 0.5073 0.04478 0.06397
## 5 0.6770 0.5222 0.02512 0.03598
## 6 0.7132 0.5739 0.03336 0.05041
## 7 0.7821 0.6764 0.03986 0.05887
## 8 0.7812 0.6748 0.03076 0.04539
## 9 0.8009 0.7051 0.02630 0.03865 *
##
## The top 5 variables (out of 9):
## Age, Systolic, isDiabetic, isSmoker, isMale
plot(results, type=c("g", "o"))
The asterisk (*) in the column indicates the number of features recommended by RFE as yielding the best model according to the resampling results. it shows that when 9 variables are used, the model achieves the best accuracy of approximately 80% and a kappa value of 0.7.
The graphical representation of feature importance :
The “Mean Decrease Gini” score tells us how crucial a feature is for making accurate predictions in a Random Forest model. A higher score means the feature is more valuable in deciding how to classify the data correctly, helping the model make better decisions.
# Setting seed for reproducibility
set.seed(123)
# Fit a random forest model
rf_model <- randomForest(Risk ~ ., data = dataset)
var_imp <- importance(rf_model)
var_imp_df <- data.frame(variables = row.names(var_imp), var_imp)
# Sorting variables based on importance
var_imp_df <- var_imp_df[order(var_imp_df$MeanDecreaseGini, decreasing = TRUE),]
# Plotting variable importance using ggplot2
ggplot(var_imp_df, aes(x = reorder(variables, MeanDecreaseGini), y = MeanDecreaseGini)) +geom_col() +
coord_flip() +
labs(title = "Feature Importance",
x = "Features",
y = "Importance (Mean Decrease in Gini)")
The graph shows that ‘Age’ and ‘Systolic’ are key variables influencing our model’s predictions of ‘Risk’. while variables like isHypertensive, isBlack were found to have the least impact on the model’s predictive capability.
Overall, we think it’s a good practice to make use of all our features as recommended by RFE, particularly when we are dealing with a modest number, to avoid potential overfitting.we
Balancing data is crucial for improving the performance and fairness of machine learning models. When data are imbalanced, with one class significantly outnumbering the others, models tend to bias towards the majority class, leading to poor predictive accuracy for minority classes.
# Calculate class distribution
class_distribution <- table(dataset$Risk)
# Create a bar plot
barplot(class_distribution,
main = "Class Distribution for Risk",
xlab = "Risk Level",
ylab = "Count",
names.arg = levels(dataset$Risk))
library(ROSE)
## Loaded ROSE 0.0-4
balanced_data <- upSample(dataset[, 1:9], dataset$Risk, yname = "Risk")
# Plot the distribution of the "Risk" classes
plot(balanced_data$Risk)
# Check the proportion and count of "Risk" classes
prop_table <- prop.table(table(balanced_data$Risk))
count_table <- table(balanced_data$Risk)
After balancing our data, the model becomes more capable of providing accurate predictions, ensuring a fair evaluation of their performance.
Classification analysis is a fundamental aspect of machine learning, focusing on categorizing data into distinct classes. In our study, we aim to build predictive models that efficiently assign predefined labels to new instances based on their features. To enhance the robustness of our models, we have divided the dataset into three sets: training, validation, and testing. By employing different proportions of training data—60%, 70%, and 80%—we seek to evaluate and compare the models’ performances. This approach ensures a comprehensive understanding of model behavior under varying training scenarios, guiding us to select the most effective model for our specific dataset.
Gain ratio is a metric that assesses the quality of a split within decision tree algorithms. to evaluate the quality of a split based on the information gain and the intrinsic information of a feature.we have implemented the Gain Ratio (C4.5) algorithm and the J48 function from the RWeka package. This algorithm partitions our data into training and testing sets, builds a J48 decision tree on the training data,
1-partition the data into ( 60% training, 40% testing):
# Load the RWeka package
library(RWeka)
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# Define the formula
myFormula <- Risk ~ .
# Build the J48 decision tree on the training data
C45Fit <- J48(myFormula, data = trainData)
# Create a table to compare predicted vs. actual values on the training data
table(predict(C45Fit), trainData$Risk)
##
## Low risk Borderline risk Intermediate risk High risk
## Low risk 240 1 3 1
## Borderline risk 6 217 5 1
## Intermediate risk 0 0 225 13
## High risk 0 3 17 227
# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
##
## Age <= 0.564103
## | HDL <= 0.225
## | | Systolic <= 0.545455
## | | | isHypertensive <= 0
## | | | | Age <= 0.025641: Low risk (6.0)
## | | | | Age > 0.025641
## | | | | | HDL <= 0.0125: Intermediate risk (6.0)
## | | | | | HDL > 0.0125
## | | | | | | Cholesterol <= 0.2
## | | | | | | | Systolic <= 0.290909: Low risk (3.0)
## | | | | | | | Systolic > 0.290909: Intermediate risk (3.0)
## | | | | | | Cholesterol > 0.2
## | | | | | | | Age <= 0.128205
## | | | | | | | | isBlack <= 0: Borderline risk (2.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | Age <= 0.051282: Intermediate risk (2.0)
## | | | | | | | | | Age > 0.051282: Low risk (2.0)
## | | | | | | | Age > 0.128205
## | | | | | | | | Age <= 0.435897: Borderline risk (18.0)
## | | | | | | | | Age > 0.435897
## | | | | | | | | | isDiabetic <= 0
## | | | | | | | | | | Age <= 0.461538: Borderline risk (3.0)
## | | | | | | | | | | Age > 0.461538
## | | | | | | | | | | | Systolic <= 0.190909: Intermediate risk (2.0)
## | | | | | | | | | | | Systolic > 0.190909: Borderline risk (3.0)
## | | | | | | | | | isDiabetic > 0: Intermediate risk (2.0)
## | | | isHypertensive > 0
## | | | | isDiabetic <= 0
## | | | | | Systolic <= 0.309091
## | | | | | | Systolic <= 0.190909
## | | | | | | | isMale <= 0: Low risk (6.0)
## | | | | | | | isMale > 0: Intermediate risk (3.0)
## | | | | | | Systolic > 0.190909: Borderline risk (5.0/1.0)
## | | | | | Systolic > 0.309091: Intermediate risk (10.0)
## | | | | isDiabetic > 0
## | | | | | isMale <= 0: Intermediate risk (11.0/1.0)
## | | | | | isMale > 0
## | | | | | | isBlack <= 0: High risk (3.0/1.0)
## | | | | | | isBlack > 0: Intermediate risk (4.0/1.0)
## | | Systolic > 0.545455
## | | | Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## | | | Cholesterol > 0.014286
## | | | | isSmoker <= 0
## | | | | | Systolic <= 0.681818: High risk (3.0)
## | | | | | Systolic > 0.681818
## | | | | | | Age <= 0.461538: Intermediate risk (9.0)
## | | | | | | Age > 0.461538: High risk (3.0/1.0)
## | | | | isSmoker > 0: High risk (29.0/6.0)
## | HDL > 0.225
## | | Age <= 0.282051
## | | | isBlack <= 0
## | | | | Cholesterol <= 0.557143
## | | | | | Systolic <= 0.718182: Low risk (78.0)
## | | | | | Systolic > 0.718182
## | | | | | | isDiabetic <= 0: Low risk (18.0/1.0)
## | | | | | | isDiabetic > 0
## | | | | | | | HDL <= 0.6375
## | | | | | | | | Systolic <= 0.909091: Low risk (5.0)
## | | | | | | | | Systolic > 0.909091: Intermediate risk (2.0)
## | | | | | | | HDL > 0.6375: Borderline risk (11.0)
## | | | | Cholesterol > 0.557143
## | | | | | Systolic <= 0.163636: Low risk (5.0)
## | | | | | Systolic > 0.163636
## | | | | | | isSmoker <= 0
## | | | | | | | Age <= 0.230769: Low risk (15.0)
## | | | | | | | Age > 0.230769: Borderline risk (8.0/1.0)
## | | | | | | isSmoker > 0
## | | | | | | | HDL <= 0.7375: Borderline risk (33.0/4.0)
## | | | | | | | HDL > 0.7375: Low risk (8.0/1.0)
## | | | isBlack > 0
## | | | | Systolic <= 0.536364
## | | | | | isMale <= 0
## | | | | | | Cholesterol <= 0.828571: Low risk (30.0/1.0)
## | | | | | | Cholesterol > 0.828571: Borderline risk (2.0)
## | | | | | isMale > 0
## | | | | | | isDiabetic <= 0
## | | | | | | | isSmoker <= 0
## | | | | | | | | isHypertensive <= 0: Low risk (9.0)
## | | | | | | | | isHypertensive > 0: Borderline risk (6.0/1.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Age <= 0.179487: Borderline risk (12.0/1.0)
## | | | | | | | | Age > 0.179487: Intermediate risk (2.0)
## | | | | | | isDiabetic > 0
## | | | | | | | Systolic <= 0.072727: Low risk (4.0)
## | | | | | | | Systolic > 0.072727: Intermediate risk (9.0/1.0)
## | | | | Systolic > 0.536364
## | | | | | isHypertensive <= 0
## | | | | | | Age <= 0.205128
## | | | | | | | isMale <= 0
## | | | | | | | | Age <= 0.128205: Borderline risk (5.0)
## | | | | | | | | Age > 0.128205: Low risk (6.0/1.0)
## | | | | | | | isMale > 0
## | | | | | | | | Cholesterol <= 0.685714: Intermediate risk (5.0/1.0)
## | | | | | | | | Cholesterol > 0.685714: Borderline risk (8.0)
## | | | | | | Age > 0.205128
## | | | | | | | isSmoker <= 0
## | | | | | | | | Age <= 0.25641: Intermediate risk (2.0)
## | | | | | | | | Age > 0.25641: Low risk (2.0)
## | | | | | | | isSmoker > 0: Intermediate risk (4.0)
## | | | | | isHypertensive > 0
## | | | | | | Systolic <= 0.890909
## | | | | | | | Age <= 0.076923
## | | | | | | | | HDL <= 0.5625: Intermediate risk (3.0)
## | | | | | | | | HDL > 0.5625: Borderline risk (7.0)
## | | | | | | | Age > 0.076923
## | | | | | | | | Age <= 0.179487: Intermediate risk (7.0)
## | | | | | | | | Age > 0.179487
## | | | | | | | | | isDiabetic <= 0: Intermediate risk (4.0/1.0)
## | | | | | | | | | isDiabetic > 0: High risk (2.0)
## | | | | | | Systolic > 0.890909: High risk (7.0)
## | | Age > 0.282051
## | | | Systolic <= 0.7
## | | | | isDiabetic <= 0
## | | | | | isMale <= 0
## | | | | | | Age <= 0.487179
## | | | | | | | Systolic <= 0.381818: Low risk (19.0)
## | | | | | | | Systolic > 0.381818
## | | | | | | | | HDL <= 0.55: Borderline risk (3.0)
## | | | | | | | | HDL > 0.55: Low risk (10.0/1.0)
## | | | | | | Age > 0.487179
## | | | | | | | Cholesterol <= 0.3: Low risk (3.0)
## | | | | | | | Cholesterol > 0.3
## | | | | | | | | Systolic <= 0.363636: Borderline risk (13.0)
## | | | | | | | | Systolic > 0.363636
## | | | | | | | | | Cholesterol <= 0.414286: Borderline risk (2.0)
## | | | | | | | | | Cholesterol > 0.414286: Intermediate risk (2.0)
## | | | | | isMale > 0
## | | | | | | Systolic <= 0.663636
## | | | | | | | isHypertensive <= 0
## | | | | | | | | isSmoker <= 0: Low risk (5.0)
## | | | | | | | | isSmoker > 0: Intermediate risk (2.0)
## | | | | | | | isHypertensive > 0: Intermediate risk (18.0)
## | | | | | | Systolic > 0.663636: Borderline risk (7.0)
## | | | | isDiabetic > 0
## | | | | | Age <= 0.461538
## | | | | | | isSmoker <= 0
## | | | | | | | isMale <= 0
## | | | | | | | | isHypertensive <= 0
## | | | | | | | | | HDL <= 0.675: Borderline risk (8.0/1.0)
## | | | | | | | | | HDL > 0.675: Low risk (4.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Systolic <= 0.290909: Low risk (2.0)
## | | | | | | | | | Systolic > 0.290909: Intermediate risk (2.0)
## | | | | | | | isMale > 0
## | | | | | | | | Systolic <= 0.072727: Borderline risk (12.0)
## | | | | | | | | Systolic > 0.072727
## | | | | | | | | | isHypertensive <= 0: Intermediate risk (5.0)
## | | | | | | | | | isHypertensive > 0: Borderline risk (4.0)
## | | | | | | isSmoker > 0
## | | | | | | | isHypertensive <= 0
## | | | | | | | | Systolic <= 0.6
## | | | | | | | | | Cholesterol <= 0.628571: Borderline risk (13.0/1.0)
## | | | | | | | | | Cholesterol > 0.628571: Intermediate risk (2.0)
## | | | | | | | | Systolic > 0.6: High risk (2.0)
## | | | | | | | isHypertensive > 0
## | | | | | | | | isMale <= 0: Intermediate risk (3.0/1.0)
## | | | | | | | | isMale > 0: High risk (2.0)
## | | | | | Age > 0.461538
## | | | | | | Cholesterol <= 0.328571: Borderline risk (2.0)
## | | | | | | Cholesterol > 0.328571: Intermediate risk (19.0/1.0)
## | | | Systolic > 0.7
## | | | | Systolic <= 0.9
## | | | | | isSmoker <= 0: Intermediate risk (12.0)
## | | | | | isSmoker > 0
## | | | | | | Age <= 0.384615: Intermediate risk (6.0)
## | | | | | | Age > 0.384615: High risk (7.0/1.0)
## | | | | Systolic > 0.9
## | | | | | isDiabetic <= 0
## | | | | | | Systolic <= 0.936364: Borderline risk (7.0)
## | | | | | | Systolic > 0.936364: Intermediate risk (5.0/1.0)
## | | | | | isDiabetic > 0: High risk (4.0)
## Age > 0.564103
## | Systolic <= 0.5
## | | isDiabetic <= 0
## | | | HDL <= 0.15
## | | | | Systolic <= 0.190909
## | | | | | isMale <= 0: Low risk (2.0)
## | | | | | isMale > 0: Intermediate risk (2.0)
## | | | | Systolic > 0.190909: High risk (9.0)
## | | | HDL > 0.15
## | | | | Systolic <= 0.427273
## | | | | | Cholesterol <= 0.7
## | | | | | | Systolic <= 0.290909
## | | | | | | | isHypertensive <= 0
## | | | | | | | | HDL <= 0.6: Intermediate risk (7.0)
## | | | | | | | | HDL > 0.6
## | | | | | | | | | Cholesterol <= 0.371429: Intermediate risk (3.0)
## | | | | | | | | | Cholesterol > 0.371429
## | | | | | | | | | | Systolic <= 0.054545: Intermediate risk (2.0)
## | | | | | | | | | | Systolic > 0.054545: Borderline risk (18.0)
## | | | | | | | isHypertensive > 0
## | | | | | | | | Systolic <= 0.172727
## | | | | | | | | | Age <= 0.692308: Low risk (3.0)
## | | | | | | | | | Age > 0.692308: Intermediate risk (3.0)
## | | | | | | | | Systolic > 0.172727: Borderline risk (7.0/1.0)
## | | | | | | Systolic > 0.290909: Intermediate risk (15.0/1.0)
## | | | | | Cholesterol > 0.7
## | | | | | | Age <= 0.897436: Intermediate risk (12.0)
## | | | | | | Age > 0.897436
## | | | | | | | Systolic <= 0.209091: Intermediate risk (3.0/1.0)
## | | | | | | | Systolic > 0.209091: High risk (3.0)
## | | | | Systolic > 0.427273
## | | | | | Systolic <= 0.472727: High risk (5.0)
## | | | | | Systolic > 0.472727: Borderline risk (5.0)
## | | isDiabetic > 0
## | | | isSmoker <= 0
## | | | | Age <= 0.923077
## | | | | | Systolic <= 0.318182: Intermediate risk (21.0/3.0)
## | | | | | Systolic > 0.318182: High risk (8.0/1.0)
## | | | | Age > 0.923077: High risk (5.0)
## | | | isSmoker > 0
## | | | | isHypertensive <= 0
## | | | | | isBlack <= 0
## | | | | | | Age <= 0.794872: Intermediate risk (4.0)
## | | | | | | Age > 0.794872: High risk (2.0)
## | | | | | isBlack > 0: High risk (3.0)
## | | | | isHypertensive > 0: High risk (22.0)
## | Systolic > 0.5: High risk (128.0/10.0)
##
## Number of Leaves : 110
##
## Size of the tree : 219
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)
# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 126 4 5 1
## Borderline risk 16 165 24 12
## Intermediate risk 6 0 87 27
## High risk 3 7 31 115
# Calculate performance metrics
accuracy_G1 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G1 <-( 1 - accuracy_G1)
sensitivity_G1 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G1 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G1 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_G1, "\n")
## Accuracy: 0.7837838
cat("Error Rate: ", error_rate_G1, "\n")
## Error Rate: 0.2162162
cat("Sensitivity (Recall): ", sensitivity_G1, "\n")
## Sensitivity (Recall): 0.7371795
cat("Specificity: ", specificity_G1, "\n")
## Specificity: 0.7991543
cat("Precision: ", precision_G1, "\n")
## Precision: 0.7419355
Analysis:
- The C4.5 decision tree, employing the gain ratio criterion, showcases robust performance on our dataset with an accuracy of 78.38%. Its ability to effectively capture complex relationships is reflected in the tree’s structure, consisting of 219 nodes and 110 leaves. Notably, the model demonstrates a balanced trade-off between sensitivity (73.72%) and specificity (79.92%), indicating its proficiency in correctly identifying positive and negative instances. With a precision of 74.19%, the model reliably makes accurate positive predictions.
The decision tree’s 110 leaves and size of 219 represent the complexity and granularity with which the model classifies ASCVD risk
The root of the tree was identified by the attribute
Age, with a threshold value of
0.564103, suggesting that age is a primary
factor in determining ASCVD risk.
Individuals with Age less than or
equal to the threshold were further analyzed for
HDL cholesterol levels. An
HDL level at or below
0.225 indicated a potential for increased
risk, with further distinctions made based on
Systolic blood pressure readings.
2-partition the data into ( 70% training, 30% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# Define the formula
myFormula <- Risk ~ .
# Build the J48 decision tree on the training data
C45Fit <- J48(myFormula, data = trainData )
# Create a table to compare predicted vs. actual values on the training data
table(predict(C45Fit), trainData$Risk)
##
## Low risk Borderline risk Intermediate risk High risk
## Low risk 272 1 5 1
## Borderline risk 4 270 6 4
## Intermediate risk 5 0 265 12
## High risk 0 0 14 273
# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
##
## Age <= 0.564103
## | HDL <= 0.225
## | | Systolic <= 0.545455
## | | | isDiabetic <= 0
## | | | | isSmoker <= 0
## | | | | | Age <= 0.25641: Low risk (13.0)
## | | | | | Age > 0.25641
## | | | | | | isHypertensive <= 0
## | | | | | | | Cholesterol <= 0.257143: Intermediate risk (2.0)
## | | | | | | | Cholesterol > 0.257143: Borderline risk (12.0/1.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Systolic <= 0.081818: Low risk (3.0)
## | | | | | | | Systolic > 0.081818
## | | | | | | | | Age <= 0.538462: Intermediate risk (5.0)
## | | | | | | | | Age > 0.538462: Borderline risk (2.0)
## | | | | isSmoker > 0
## | | | | | isBlack <= 0
## | | | | | | Systolic <= 0.309091: Borderline risk (9.0/1.0)
## | | | | | | Systolic > 0.309091: Intermediate risk (2.0)
## | | | | | isBlack > 0: Intermediate risk (13.0/1.0)
## | | | isDiabetic > 0
## | | | | Age <= 0.410256
## | | | | | Cholesterol <= 0.357143: Intermediate risk (8.0/1.0)
## | | | | | Cholesterol > 0.357143
## | | | | | | isHypertensive <= 0: Borderline risk (17.0/1.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Cholesterol <= 0.685714
## | | | | | | | | isSmoker <= 0: Borderline risk (3.0)
## | | | | | | | | isSmoker > 0: High risk (2.0)
## | | | | | | | Cholesterol > 0.685714: Intermediate risk (4.0)
## | | | | Age > 0.410256
## | | | | | Cholesterol <= 0.271429: Low risk (3.0/1.0)
## | | | | | Cholesterol > 0.271429: Intermediate risk (12.0/1.0)
## | | Systolic > 0.545455
## | | | Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## | | | Cholesterol > 0.014286
## | | | | isSmoker <= 0
## | | | | | isDiabetic <= 0: Intermediate risk (12.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | isMale <= 0: Intermediate risk (4.0/1.0)
## | | | | | | isMale > 0: High risk (5.0)
## | | | | isSmoker > 0
## | | | | | isDiabetic <= 0
## | | | | | | Cholesterol <= 0.242857: Intermediate risk (2.0)
## | | | | | | Cholesterol > 0.242857: High risk (13.0/2.0)
## | | | | | isDiabetic > 0
## | | | | | | HDL <= 0.2: High risk (17.0)
## | | | | | | HDL > 0.2: Intermediate risk (3.0/1.0)
## | HDL > 0.225
## | | Age <= 0.282051
## | | | Systolic <= 0.163636
## | | | | isBlack <= 0: Low risk (44.0)
## | | | | isBlack > 0
## | | | | | isMale <= 0: Low risk (9.0)
## | | | | | isMale > 0
## | | | | | | Systolic <= 0.090909: Low risk (6.0)
## | | | | | | Systolic > 0.090909: Intermediate risk (4.0)
## | | | Systolic > 0.163636
## | | | | isBlack <= 0
## | | | | | Cholesterol <= 0.242857: Low risk (38.0/1.0)
## | | | | | Cholesterol > 0.242857
## | | | | | | HDL <= 0.8125
## | | | | | | | isSmoker <= 0
## | | | | | | | | Age <= 0.230769: Low risk (31.0)
## | | | | | | | | Age > 0.230769
## | | | | | | | | | isMale <= 0: Low risk (2.0)
## | | | | | | | | | isMale > 0: Borderline risk (12.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Systolic <= 0.309091
## | | | | | | | | | Systolic <= 0.218182: Borderline risk (3.0)
## | | | | | | | | | Systolic > 0.218182: Low risk (9.0)
## | | | | | | | | Systolic > 0.309091
## | | | | | | | | | isMale <= 0
## | | | | | | | | | | isHypertensive <= 0: Borderline risk (17.0/1.0)
## | | | | | | | | | | isHypertensive > 0: Low risk (4.0)
## | | | | | | | | | isMale > 0
## | | | | | | | | | | Systolic <= 0.9
## | | | | | | | | | | | HDL <= 0.4625
## | | | | | | | | | | | | isDiabetic <= 0: Borderline risk (8.0/1.0)
## | | | | | | | | | | | | isDiabetic > 0: Intermediate risk (2.0)
## | | | | | | | | | | | HDL > 0.4625: Borderline risk (23.0)
## | | | | | | | | | | Systolic > 0.9: Intermediate risk (2.0)
## | | | | | | HDL > 0.8125
## | | | | | | | isMale <= 0: Low risk (17.0)
## | | | | | | | isMale > 0
## | | | | | | | | Age <= 0.076923: Low risk (3.0)
## | | | | | | | | Age > 0.076923: Intermediate risk (2.0)
## | | | | isBlack > 0
## | | | | | isDiabetic <= 0
## | | | | | | Systolic <= 0.554545
## | | | | | | | Systolic <= 0.245455
## | | | | | | | | isMale <= 0: Low risk (2.0)
## | | | | | | | | isMale > 0: Borderline risk (15.0/1.0)
## | | | | | | | Systolic > 0.245455
## | | | | | | | | isHypertensive <= 0: Low risk (20.0/2.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | HDL <= 0.4625: Borderline risk (5.0/1.0)
## | | | | | | | | | HDL > 0.4625: Low risk (5.0)
## | | | | | | Systolic > 0.554545
## | | | | | | | isSmoker <= 0
## | | | | | | | | isMale <= 0
## | | | | | | | | | Age <= 0.153846: Borderline risk (5.0/1.0)
## | | | | | | | | | Age > 0.153846: Low risk (6.0)
## | | | | | | | | isMale > 0
## | | | | | | | | | Cholesterol <= 0.7
## | | | | | | | | | | Systolic <= 0.718182: Borderline risk (3.0)
## | | | | | | | | | | Systolic > 0.718182: Intermediate risk (2.0)
## | | | | | | | | | Cholesterol > 0.7: Borderline risk (10.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Age <= 0: Borderline risk (5.0/1.0)
## | | | | | | | | Age > 0
## | | | | | | | | | Cholesterol <= 0.071429: Low risk (2.0)
## | | | | | | | | | Cholesterol > 0.071429
## | | | | | | | | | | Cholesterol <= 0.871429: Intermediate risk (12.0)
## | | | | | | | | | | Cholesterol > 0.871429: High risk (3.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | Systolic <= 0.309091
## | | | | | | | HDL <= 0.4375
## | | | | | | | | Age <= 0.205128: Intermediate risk (2.0)
## | | | | | | | | Age > 0.205128: Borderline risk (3.0)
## | | | | | | | HDL > 0.4375: Low risk (6.0)
## | | | | | | Systolic > 0.309091
## | | | | | | | isHypertensive <= 0
## | | | | | | | | Cholesterol <= 0.314286: Borderline risk (5.0/1.0)
## | | | | | | | | Cholesterol > 0.314286: Intermediate risk (10.0/2.0)
## | | | | | | | isHypertensive > 0
## | | | | | | | | Age <= 0.153846
## | | | | | | | | | Systolic <= 0.881818: Intermediate risk (9.0/1.0)
## | | | | | | | | | Systolic > 0.881818: High risk (2.0)
## | | | | | | | | Age > 0.153846: High risk (6.0)
## | | Age > 0.282051
## | | | Systolic <= 0.254545
## | | | | isDiabetic <= 0
## | | | | | isHypertensive <= 0: Low risk (20.0)
## | | | | | isHypertensive > 0
## | | | | | | isMale <= 0
## | | | | | | | Cholesterol <= 0.385714: Low risk (5.0)
## | | | | | | | Cholesterol > 0.385714: Borderline risk (15.0/1.0)
## | | | | | | isMale > 0: Intermediate risk (6.0/1.0)
## | | | | isDiabetic > 0
## | | | | | Age <= 0.435897
## | | | | | | isHypertensive <= 0
## | | | | | | | Systolic <= 0.2: Borderline risk (21.0/1.0)
## | | | | | | | Systolic > 0.2: Low risk (3.0/1.0)
## | | | | | | isHypertensive > 0: Low risk (3.0)
## | | | | | Age > 0.435897: Intermediate risk (10.0)
## | | | Systolic > 0.254545
## | | | | isMale <= 0
## | | | | | isDiabetic <= 0
## | | | | | | Age <= 0.384615
## | | | | | | | HDL <= 0.5125: Intermediate risk (2.0)
## | | | | | | | HDL > 0.5125: Low risk (12.0)
## | | | | | | Age > 0.384615
## | | | | | | | Cholesterol <= 0.814286
## | | | | | | | | Systolic <= 0.936364
## | | | | | | | | | Cholesterol <= 0.414286
## | | | | | | | | | | isSmoker <= 0
## | | | | | | | | | | | Age <= 0.512821: Low risk (2.0)
## | | | | | | | | | | | Age > 0.512821: Borderline risk (8.0)
## | | | | | | | | | | isSmoker > 0: Borderline risk (7.0)
## | | | | | | | | | Cholesterol > 0.414286
## | | | | | | | | | | Age <= 0.512821: Borderline risk (5.0)
## | | | | | | | | | | Age > 0.512821: Intermediate risk (3.0)
## | | | | | | | | Systolic > 0.936364: Intermediate risk (2.0)
## | | | | | | | Cholesterol > 0.814286: Low risk (3.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | isHypertensive <= 0
## | | | | | | | Systolic <= 0.609091
## | | | | | | | | isBlack <= 0
## | | | | | | | | | Age <= 0.333333: Low risk (2.0)
## | | | | | | | | | Age > 0.333333: Borderline risk (13.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | isSmoker <= 0: Borderline risk (4.0)
## | | | | | | | | | isSmoker > 0: Intermediate risk (3.0)
## | | | | | | | Systolic > 0.609091
## | | | | | | | | isBlack <= 0: Intermediate risk (6.0/1.0)
## | | | | | | | | isBlack > 0: High risk (2.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Systolic <= 0.827273
## | | | | | | | | isBlack <= 0: Intermediate risk (9.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | Cholesterol <= 0.814286: Intermediate risk (7.0)
## | | | | | | | | | Cholesterol > 0.814286: High risk (2.0)
## | | | | | | | Systolic > 0.827273: High risk (3.0)
## | | | | isMale > 0
## | | | | | Cholesterol <= 0.914286
## | | | | | | isSmoker <= 0
## | | | | | | | isDiabetic <= 0: Intermediate risk (18.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | isHypertensive <= 0: Intermediate risk (6.0/1.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Age <= 0.435897: Borderline risk (4.0)
## | | | | | | | | | Age > 0.435897: Intermediate risk (2.0)
## | | | | | | isSmoker > 0
## | | | | | | | isDiabetic <= 0
## | | | | | | | | isHypertensive <= 0: Intermediate risk (7.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Systolic <= 0.690909: Intermediate risk (7.0)
## | | | | | | | | | Systolic > 0.690909: High risk (4.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | Cholesterol <= 0.128571: Intermediate risk (4.0)
## | | | | | | | | Cholesterol > 0.128571: High risk (11.0/1.0)
## | | | | | Cholesterol > 0.914286
## | | | | | | isHypertensive <= 0: Intermediate risk (2.0)
## | | | | | | isHypertensive > 0: Borderline risk (7.0)
## Age > 0.564103
## | Systolic <= 0.5
## | | isDiabetic <= 0
## | | | HDL <= 0.15
## | | | | Systolic <= 0.190909
## | | | | | isMale <= 0: Low risk (2.0)
## | | | | | isMale > 0: Intermediate risk (2.0)
## | | | | Systolic > 0.190909: High risk (9.0)
## | | | HDL > 0.15
## | | | | Age <= 0.692308
## | | | | | isSmoker <= 0
## | | | | | | Systolic <= 0.172727: Low risk (4.0/1.0)
## | | | | | | Systolic > 0.172727
## | | | | | | | Age <= 0.589744: Intermediate risk (3.0/1.0)
## | | | | | | | Age > 0.589744: Borderline risk (22.0)
## | | | | | isSmoker > 0
## | | | | | | Age <= 0.589744: Borderline risk (3.0)
## | | | | | | Age > 0.589744: Intermediate risk (9.0/1.0)
## | | | | Age > 0.692308
## | | | | | HDL <= 0.975
## | | | | | | Systolic <= 0.427273
## | | | | | | | Cholesterol <= 0.057143
## | | | | | | | | Cholesterol <= 0.028571: Intermediate risk (5.0)
## | | | | | | | | Cholesterol > 0.028571: Borderline risk (4.0)
## | | | | | | | Cholesterol > 0.057143
## | | | | | | | | isSmoker <= 0: Intermediate risk (25.0/2.0)
## | | | | | | | | isSmoker > 0
## | | | | | | | | | Age <= 0.769231: Intermediate risk (4.0)
## | | | | | | | | | Age > 0.769231
## | | | | | | | | | | Systolic <= 0.072727: Intermediate risk (2.0)
## | | | | | | | | | | Systolic > 0.072727: High risk (4.0)
## | | | | | | Systolic > 0.427273: High risk (5.0)
## | | | | | HDL > 0.975: Borderline risk (5.0)
## | | isDiabetic > 0
## | | | isSmoker <= 0
## | | | | Systolic <= 0.318182
## | | | | | Age <= 0.820513: Intermediate risk (18.0/1.0)
## | | | | | Age > 0.820513
## | | | | | | isHypertensive <= 0
## | | | | | | | Age <= 0.948718: Intermediate risk (4.0)
## | | | | | | | Age > 0.948718: High risk (4.0/1.0)
## | | | | | | isHypertensive > 0: High risk (3.0)
## | | | | Systolic > 0.318182: High risk (10.0/1.0)
## | | | isSmoker > 0
## | | | | isHypertensive <= 0
## | | | | | isBlack <= 0
## | | | | | | Age <= 0.794872: Intermediate risk (4.0)
## | | | | | | Age > 0.794872: High risk (2.0)
## | | | | | isBlack > 0: High risk (4.0)
## | | | | isHypertensive > 0: High risk (28.0)
## | Systolic > 0.5
## | | Age <= 0.589744
## | | | isDiabetic <= 0: Borderline risk (4.0/1.0)
## | | | isDiabetic > 0: High risk (7.0/1.0)
## | | Age > 0.589744: High risk (141.0/7.0)
##
## Number of Leaves : 131
##
## Size of the tree : 261
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 126 4 5 1
## Borderline risk 16 165 24 12
## Intermediate risk 6 0 87 27
## High risk 3 7 31 115
# Calculate performance metrics
accuracy_G2 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G2 <-( 1 - accuracy_G2)
sensitivity_G2 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G2 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G2 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_G2, "\n")
## Accuracy: 0.7837838
cat("Error Rate: ", error_rate_G2, "\n")
## Error Rate: 0.2162162
cat("Sensitivity (Recall): ", sensitivity_G2, "\n")
## Sensitivity (Recall): 0.7371795
cat("Specificity: ", specificity_G2, "\n")
## Specificity: 0.7991543
cat("Precision: ", precision_G2, "\n")
## Precision: 0.7419355
Analysis:
The C4.5 decision tree, employing the gain ratio criterion, exhibits strong predictive accuracy with an impressive 78.39%. Characterized by 261 nodes and 131 leaves, the tree’s depth allows it to capture intricate patterns within the data. Notably, the model strikes a balance between sensitivity (73%) and specificity (79.78%), showcasing its ability to effectively identify positive and negative instances. With a precision of 74.1%, the model demonstrates accuracy in positive predictions.
Age is a critical initial factor. Individuals younger than the
threshold value are further assessed for HDL and systolic blood pressure
levels.with a threshold of
0.564103.
DL cholesterol and Systolic blood pressure are critical secondary predictors, stratifying patients into risk categories from low to high.
The highest risk category is assigned to older individuals
(Age > 0.589744) with high
Systolic blood pressure, indicating that
age and blood pressure are critical factors in predicting high ASCVD
risk.
3-partition the data into ( 80% training, 20% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# Define the formula
myFormula <- Risk ~ .
# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
##
## Age <= 0.564103
## | HDL <= 0.225
## | | Systolic <= 0.545455
## | | | isDiabetic <= 0
## | | | | isSmoker <= 0
## | | | | | Age <= 0.25641: Low risk (13.0)
## | | | | | Age > 0.25641
## | | | | | | isHypertensive <= 0
## | | | | | | | Cholesterol <= 0.257143: Intermediate risk (2.0)
## | | | | | | | Cholesterol > 0.257143: Borderline risk (12.0/1.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Systolic <= 0.081818: Low risk (3.0)
## | | | | | | | Systolic > 0.081818
## | | | | | | | | Age <= 0.538462: Intermediate risk (5.0)
## | | | | | | | | Age > 0.538462: Borderline risk (2.0)
## | | | | isSmoker > 0
## | | | | | isBlack <= 0
## | | | | | | Systolic <= 0.309091: Borderline risk (9.0/1.0)
## | | | | | | Systolic > 0.309091: Intermediate risk (2.0)
## | | | | | isBlack > 0: Intermediate risk (13.0/1.0)
## | | | isDiabetic > 0
## | | | | Age <= 0.410256
## | | | | | Cholesterol <= 0.357143: Intermediate risk (8.0/1.0)
## | | | | | Cholesterol > 0.357143
## | | | | | | isHypertensive <= 0: Borderline risk (17.0/1.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Cholesterol <= 0.685714
## | | | | | | | | isSmoker <= 0: Borderline risk (3.0)
## | | | | | | | | isSmoker > 0: High risk (2.0)
## | | | | | | | Cholesterol > 0.685714: Intermediate risk (4.0)
## | | | | Age > 0.410256
## | | | | | Cholesterol <= 0.271429: Low risk (3.0/1.0)
## | | | | | Cholesterol > 0.271429: Intermediate risk (12.0/1.0)
## | | Systolic > 0.545455
## | | | Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## | | | Cholesterol > 0.014286
## | | | | isSmoker <= 0
## | | | | | isDiabetic <= 0: Intermediate risk (12.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | isMale <= 0: Intermediate risk (4.0/1.0)
## | | | | | | isMale > 0: High risk (5.0)
## | | | | isSmoker > 0
## | | | | | isDiabetic <= 0
## | | | | | | Cholesterol <= 0.242857: Intermediate risk (2.0)
## | | | | | | Cholesterol > 0.242857: High risk (13.0/2.0)
## | | | | | isDiabetic > 0
## | | | | | | HDL <= 0.2: High risk (17.0)
## | | | | | | HDL > 0.2: Intermediate risk (3.0/1.0)
## | HDL > 0.225
## | | Age <= 0.282051
## | | | Systolic <= 0.163636
## | | | | isBlack <= 0: Low risk (44.0)
## | | | | isBlack > 0
## | | | | | isMale <= 0: Low risk (9.0)
## | | | | | isMale > 0
## | | | | | | Systolic <= 0.090909: Low risk (6.0)
## | | | | | | Systolic > 0.090909: Intermediate risk (4.0)
## | | | Systolic > 0.163636
## | | | | isBlack <= 0
## | | | | | Cholesterol <= 0.242857: Low risk (38.0/1.0)
## | | | | | Cholesterol > 0.242857
## | | | | | | HDL <= 0.8125
## | | | | | | | isSmoker <= 0
## | | | | | | | | Age <= 0.230769: Low risk (31.0)
## | | | | | | | | Age > 0.230769
## | | | | | | | | | isMale <= 0: Low risk (2.0)
## | | | | | | | | | isMale > 0: Borderline risk (12.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Systolic <= 0.309091
## | | | | | | | | | Systolic <= 0.218182: Borderline risk (3.0)
## | | | | | | | | | Systolic > 0.218182: Low risk (9.0)
## | | | | | | | | Systolic > 0.309091
## | | | | | | | | | isMale <= 0
## | | | | | | | | | | isHypertensive <= 0: Borderline risk (17.0/1.0)
## | | | | | | | | | | isHypertensive > 0: Low risk (4.0)
## | | | | | | | | | isMale > 0
## | | | | | | | | | | Systolic <= 0.9
## | | | | | | | | | | | HDL <= 0.4625
## | | | | | | | | | | | | isDiabetic <= 0: Borderline risk (8.0/1.0)
## | | | | | | | | | | | | isDiabetic > 0: Intermediate risk (2.0)
## | | | | | | | | | | | HDL > 0.4625: Borderline risk (23.0)
## | | | | | | | | | | Systolic > 0.9: Intermediate risk (2.0)
## | | | | | | HDL > 0.8125
## | | | | | | | isMale <= 0: Low risk (17.0)
## | | | | | | | isMale > 0
## | | | | | | | | Age <= 0.076923: Low risk (3.0)
## | | | | | | | | Age > 0.076923: Intermediate risk (2.0)
## | | | | isBlack > 0
## | | | | | isDiabetic <= 0
## | | | | | | Systolic <= 0.554545
## | | | | | | | Systolic <= 0.245455
## | | | | | | | | isMale <= 0: Low risk (2.0)
## | | | | | | | | isMale > 0: Borderline risk (15.0/1.0)
## | | | | | | | Systolic > 0.245455
## | | | | | | | | isHypertensive <= 0: Low risk (20.0/2.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | HDL <= 0.4625: Borderline risk (5.0/1.0)
## | | | | | | | | | HDL > 0.4625: Low risk (5.0)
## | | | | | | Systolic > 0.554545
## | | | | | | | isSmoker <= 0
## | | | | | | | | isMale <= 0
## | | | | | | | | | Age <= 0.153846: Borderline risk (5.0/1.0)
## | | | | | | | | | Age > 0.153846: Low risk (6.0)
## | | | | | | | | isMale > 0
## | | | | | | | | | Cholesterol <= 0.7
## | | | | | | | | | | Systolic <= 0.718182: Borderline risk (3.0)
## | | | | | | | | | | Systolic > 0.718182: Intermediate risk (2.0)
## | | | | | | | | | Cholesterol > 0.7: Borderline risk (10.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Age <= 0: Borderline risk (5.0/1.0)
## | | | | | | | | Age > 0
## | | | | | | | | | Cholesterol <= 0.071429: Low risk (2.0)
## | | | | | | | | | Cholesterol > 0.071429
## | | | | | | | | | | Cholesterol <= 0.871429: Intermediate risk (12.0)
## | | | | | | | | | | Cholesterol > 0.871429: High risk (3.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | Systolic <= 0.309091
## | | | | | | | HDL <= 0.4375
## | | | | | | | | Age <= 0.205128: Intermediate risk (2.0)
## | | | | | | | | Age > 0.205128: Borderline risk (3.0)
## | | | | | | | HDL > 0.4375: Low risk (6.0)
## | | | | | | Systolic > 0.309091
## | | | | | | | isHypertensive <= 0
## | | | | | | | | Cholesterol <= 0.314286: Borderline risk (5.0/1.0)
## | | | | | | | | Cholesterol > 0.314286: Intermediate risk (10.0/2.0)
## | | | | | | | isHypertensive > 0
## | | | | | | | | Age <= 0.153846
## | | | | | | | | | Systolic <= 0.881818: Intermediate risk (9.0/1.0)
## | | | | | | | | | Systolic > 0.881818: High risk (2.0)
## | | | | | | | | Age > 0.153846: High risk (6.0)
## | | Age > 0.282051
## | | | Systolic <= 0.254545
## | | | | isDiabetic <= 0
## | | | | | isHypertensive <= 0: Low risk (20.0)
## | | | | | isHypertensive > 0
## | | | | | | isMale <= 0
## | | | | | | | Cholesterol <= 0.385714: Low risk (5.0)
## | | | | | | | Cholesterol > 0.385714: Borderline risk (15.0/1.0)
## | | | | | | isMale > 0: Intermediate risk (6.0/1.0)
## | | | | isDiabetic > 0
## | | | | | Age <= 0.435897
## | | | | | | isHypertensive <= 0
## | | | | | | | Systolic <= 0.2: Borderline risk (21.0/1.0)
## | | | | | | | Systolic > 0.2: Low risk (3.0/1.0)
## | | | | | | isHypertensive > 0: Low risk (3.0)
## | | | | | Age > 0.435897: Intermediate risk (10.0)
## | | | Systolic > 0.254545
## | | | | isMale <= 0
## | | | | | isDiabetic <= 0
## | | | | | | Age <= 0.384615
## | | | | | | | HDL <= 0.5125: Intermediate risk (2.0)
## | | | | | | | HDL > 0.5125: Low risk (12.0)
## | | | | | | Age > 0.384615
## | | | | | | | Cholesterol <= 0.814286
## | | | | | | | | Systolic <= 0.936364
## | | | | | | | | | Cholesterol <= 0.414286
## | | | | | | | | | | isSmoker <= 0
## | | | | | | | | | | | Age <= 0.512821: Low risk (2.0)
## | | | | | | | | | | | Age > 0.512821: Borderline risk (8.0)
## | | | | | | | | | | isSmoker > 0: Borderline risk (7.0)
## | | | | | | | | | Cholesterol > 0.414286
## | | | | | | | | | | Age <= 0.512821: Borderline risk (5.0)
## | | | | | | | | | | Age > 0.512821: Intermediate risk (3.0)
## | | | | | | | | Systolic > 0.936364: Intermediate risk (2.0)
## | | | | | | | Cholesterol > 0.814286: Low risk (3.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | isHypertensive <= 0
## | | | | | | | Systolic <= 0.609091
## | | | | | | | | isBlack <= 0
## | | | | | | | | | Age <= 0.333333: Low risk (2.0)
## | | | | | | | | | Age > 0.333333: Borderline risk (13.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | isSmoker <= 0: Borderline risk (4.0)
## | | | | | | | | | isSmoker > 0: Intermediate risk (3.0)
## | | | | | | | Systolic > 0.609091
## | | | | | | | | isBlack <= 0: Intermediate risk (6.0/1.0)
## | | | | | | | | isBlack > 0: High risk (2.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Systolic <= 0.827273
## | | | | | | | | isBlack <= 0: Intermediate risk (9.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | Cholesterol <= 0.814286: Intermediate risk (7.0)
## | | | | | | | | | Cholesterol > 0.814286: High risk (2.0)
## | | | | | | | Systolic > 0.827273: High risk (3.0)
## | | | | isMale > 0
## | | | | | Cholesterol <= 0.914286
## | | | | | | isSmoker <= 0
## | | | | | | | isDiabetic <= 0: Intermediate risk (18.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | isHypertensive <= 0: Intermediate risk (6.0/1.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Age <= 0.435897: Borderline risk (4.0)
## | | | | | | | | | Age > 0.435897: Intermediate risk (2.0)
## | | | | | | isSmoker > 0
## | | | | | | | isDiabetic <= 0
## | | | | | | | | isHypertensive <= 0: Intermediate risk (7.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Systolic <= 0.690909: Intermediate risk (7.0)
## | | | | | | | | | Systolic > 0.690909: High risk (4.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | Cholesterol <= 0.128571: Intermediate risk (4.0)
## | | | | | | | | Cholesterol > 0.128571: High risk (11.0/1.0)
## | | | | | Cholesterol > 0.914286
## | | | | | | isHypertensive <= 0: Intermediate risk (2.0)
## | | | | | | isHypertensive > 0: Borderline risk (7.0)
## Age > 0.564103
## | Systolic <= 0.5
## | | isDiabetic <= 0
## | | | HDL <= 0.15
## | | | | Systolic <= 0.190909
## | | | | | isMale <= 0: Low risk (2.0)
## | | | | | isMale > 0: Intermediate risk (2.0)
## | | | | Systolic > 0.190909: High risk (9.0)
## | | | HDL > 0.15
## | | | | Age <= 0.692308
## | | | | | isSmoker <= 0
## | | | | | | Systolic <= 0.172727: Low risk (4.0/1.0)
## | | | | | | Systolic > 0.172727
## | | | | | | | Age <= 0.589744: Intermediate risk (3.0/1.0)
## | | | | | | | Age > 0.589744: Borderline risk (22.0)
## | | | | | isSmoker > 0
## | | | | | | Age <= 0.589744: Borderline risk (3.0)
## | | | | | | Age > 0.589744: Intermediate risk (9.0/1.0)
## | | | | Age > 0.692308
## | | | | | HDL <= 0.975
## | | | | | | Systolic <= 0.427273
## | | | | | | | Cholesterol <= 0.057143
## | | | | | | | | Cholesterol <= 0.028571: Intermediate risk (5.0)
## | | | | | | | | Cholesterol > 0.028571: Borderline risk (4.0)
## | | | | | | | Cholesterol > 0.057143
## | | | | | | | | isSmoker <= 0: Intermediate risk (25.0/2.0)
## | | | | | | | | isSmoker > 0
## | | | | | | | | | Age <= 0.769231: Intermediate risk (4.0)
## | | | | | | | | | Age > 0.769231
## | | | | | | | | | | Systolic <= 0.072727: Intermediate risk (2.0)
## | | | | | | | | | | Systolic > 0.072727: High risk (4.0)
## | | | | | | Systolic > 0.427273: High risk (5.0)
## | | | | | HDL > 0.975: Borderline risk (5.0)
## | | isDiabetic > 0
## | | | isSmoker <= 0
## | | | | Systolic <= 0.318182
## | | | | | Age <= 0.820513: Intermediate risk (18.0/1.0)
## | | | | | Age > 0.820513
## | | | | | | isHypertensive <= 0
## | | | | | | | Age <= 0.948718: Intermediate risk (4.0)
## | | | | | | | Age > 0.948718: High risk (4.0/1.0)
## | | | | | | isHypertensive > 0: High risk (3.0)
## | | | | Systolic > 0.318182: High risk (10.0/1.0)
## | | | isSmoker > 0
## | | | | isHypertensive <= 0
## | | | | | isBlack <= 0
## | | | | | | Age <= 0.794872: Intermediate risk (4.0)
## | | | | | | Age > 0.794872: High risk (2.0)
## | | | | | isBlack > 0: High risk (4.0)
## | | | | isHypertensive > 0: High risk (28.0)
## | Systolic > 0.5
## | | Age <= 0.589744
## | | | isDiabetic <= 0: Borderline risk (4.0/1.0)
## | | | isDiabetic > 0: High risk (7.0/1.0)
## | | Age > 0.589744: High risk (141.0/7.0)
##
## Number of Leaves : 131
##
## Size of the tree : 261
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)
# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 58 2 8 2
## Borderline risk 8 90 6 2
## Intermediate risk 9 0 42 17
## High risk 0 0 17 55
# Calculate performance metrics
accuracy_G3 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G3 <-( 1 - accuracy_G3)
sensitivity_G3 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G3 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G3 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
accuracy <- sum(testPred == testData$Risk) / length(testPred)
# Display performance metrics
cat("Accuracy: ", accuracy_G3, "\n")
## Accuracy: 0.7753165
cat("Error Rate: ", error_rate_G3, "\n")
## Error Rate: 0.2246835
cat("Sensitivity (Recall): ", sensitivity_G3, "\n")
## Sensitivity (Recall): 0.7638889
cat("Specificity: ", specificity_G3, "\n")
## Specificity: 0.7786885
cat("Precision: ", precision_G3, "\n")
## Precision: 0.7236842
Analysis:
The C4.5 decision tree, employing the gain ratio criterion, demonstrates a commendable accuracy of 81.%. With a substantial tree size of 305 and 153 leaves, the model captures nuanced relationships within the dataset. Its predictive prowess is evident in the balanced sensitivity (78.26%) and specificity (81.78%), highlighting its ability to correctly identify both positive and negative instances. The precision of 71.05% emphasizes the accuracy of positive predictions. This collectively positions the C4.5 decision tree as a robust and effective choice for classification on our dataset, showcasing its capability to achieve high accuracy and reliable predictions.
# Create data frames for each model's summary
summary_c4.5_1 <- data.frame(
Model = "60% training, 40% testing",
Accuracy = 78.38,
Sensitivity = 73.72,
Specificity = 79.92,
Precision = 74.19
)
summary_c4.5_2 <- data.frame(
Model = "70% training, 30% testing",
Accuracy = 79.39,
Sensitivity = 78.0,
Specificity = 79.78,
Precision = 72.90
)
summary_c4.5_3 <- data.frame(
Model = "80% training, 20% testing",
Accuracy = 81.01,
Sensitivity = 78.26,
Specificity = 81.78,
Precision = 71.05
)
# Combine the summaries into a single data frame
comparison_table <- rbind(summary_c4.5_1, summary_c4.5_2, summary_c4.5_3)
# Print the comparison table
print(comparison_table)
## Model Accuracy Sensitivity Specificity Precision
## 1 60% training, 40% testing 78.38 73.72 79.92 74.19
## 2 70% training, 30% testing 79.39 78.00 79.78 72.90
## 3 80% training, 20% testing 81.01 78.26 81.78 71.05
In our exploration of decision tree models—specifically, C4.5 with varying numbers of training-testing —we aimed to identify the optimal configuration for accurate and reliable predictions. The results indicate that the model with (80% training, 20% testing) stands out, achieving the highest accuracy at 81.01%. This particular configuration strikes a balance between sensitivity (78.26%), specificity (81.78%), and precision (71.05%), making it a robust choice for the classification task at hand.
It’s noteworthy that the model with (70% training, 30% testing) also performs well, showcasing competitive accuracy (79.39%) and a balanced trade-off between sensitivity and specificity. However, the model with (60% training, 40% testing) surpasses it, demonstrating superior sensitivity and precision.
In contrast, the model with (60% training, 40% testing), while achieving a respectable accuracy of 78.38%, exhibits slightly lower sensitivity and precision. This suggests that a more complex tree structure, as seen in the model with (80% training, 20% testing), contributes to better capturing the underlying patterns in the data.
In conclusion, the C4.5 decision tree with (80% training, 20% testing) emerges as the preferred model for this specific dataset and classification task. Its superior performance in terms of accuracy, sensitivity, specificity, and precision underscores its suitability for making reliable predictions.
-For the construction of our decision tree model, we have opted for the C5.0 algorithm, a sophisticated and versatile tool known for its proficiency in handling classification tasks. Specifically, we harness the power of information gain as the guiding criterion within C5.0. This choice is deliberate, as information gain allows the algorithm to discern the most relevant and discriminative features in our dataset, facilitating the creation of a decision tree that excels at capturing intricate patterns and relationships.
1-partition the data into ( 60% training, 40% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# install.packages("C50")
library(C50)
# Define the formula
myFormula <- Risk ~ .
# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)
# Display a summary of the decision tree
print(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
## Classification Tree
## Number of samples: 959
## Number of predictors: 9
##
## Tree size: 105
##
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)
print(C45Fit)
## J48 pruned tree
## ------------------
##
## Age <= 0.564103
## | HDL <= 0.225
## | | Systolic <= 0.545455
## | | | isDiabetic <= 0
## | | | | isSmoker <= 0
## | | | | | Age <= 0.25641: Low risk (13.0)
## | | | | | Age > 0.25641
## | | | | | | isHypertensive <= 0
## | | | | | | | Cholesterol <= 0.257143: Intermediate risk (2.0)
## | | | | | | | Cholesterol > 0.257143: Borderline risk (12.0/1.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Systolic <= 0.081818: Low risk (3.0)
## | | | | | | | Systolic > 0.081818
## | | | | | | | | Age <= 0.538462: Intermediate risk (5.0)
## | | | | | | | | Age > 0.538462: Borderline risk (2.0)
## | | | | isSmoker > 0
## | | | | | isBlack <= 0
## | | | | | | Systolic <= 0.309091: Borderline risk (9.0/1.0)
## | | | | | | Systolic > 0.309091: Intermediate risk (2.0)
## | | | | | isBlack > 0: Intermediate risk (13.0/1.0)
## | | | isDiabetic > 0
## | | | | Age <= 0.410256
## | | | | | Cholesterol <= 0.357143: Intermediate risk (8.0/1.0)
## | | | | | Cholesterol > 0.357143
## | | | | | | isHypertensive <= 0: Borderline risk (17.0/1.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Cholesterol <= 0.685714
## | | | | | | | | isSmoker <= 0: Borderline risk (3.0)
## | | | | | | | | isSmoker > 0: High risk (2.0)
## | | | | | | | Cholesterol > 0.685714: Intermediate risk (4.0)
## | | | | Age > 0.410256
## | | | | | Cholesterol <= 0.271429: Low risk (3.0/1.0)
## | | | | | Cholesterol > 0.271429: Intermediate risk (12.0/1.0)
## | | Systolic > 0.545455
## | | | Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## | | | Cholesterol > 0.014286
## | | | | isSmoker <= 0
## | | | | | isDiabetic <= 0: Intermediate risk (12.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | isMale <= 0: Intermediate risk (4.0/1.0)
## | | | | | | isMale > 0: High risk (5.0)
## | | | | isSmoker > 0
## | | | | | isDiabetic <= 0
## | | | | | | Cholesterol <= 0.242857: Intermediate risk (2.0)
## | | | | | | Cholesterol > 0.242857: High risk (13.0/2.0)
## | | | | | isDiabetic > 0
## | | | | | | HDL <= 0.2: High risk (17.0)
## | | | | | | HDL > 0.2: Intermediate risk (3.0/1.0)
## | HDL > 0.225
## | | Age <= 0.282051
## | | | Systolic <= 0.163636
## | | | | isBlack <= 0: Low risk (44.0)
## | | | | isBlack > 0
## | | | | | isMale <= 0: Low risk (9.0)
## | | | | | isMale > 0
## | | | | | | Systolic <= 0.090909: Low risk (6.0)
## | | | | | | Systolic > 0.090909: Intermediate risk (4.0)
## | | | Systolic > 0.163636
## | | | | isBlack <= 0
## | | | | | Cholesterol <= 0.242857: Low risk (38.0/1.0)
## | | | | | Cholesterol > 0.242857
## | | | | | | HDL <= 0.8125
## | | | | | | | isSmoker <= 0
## | | | | | | | | Age <= 0.230769: Low risk (31.0)
## | | | | | | | | Age > 0.230769
## | | | | | | | | | isMale <= 0: Low risk (2.0)
## | | | | | | | | | isMale > 0: Borderline risk (12.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Systolic <= 0.309091
## | | | | | | | | | Systolic <= 0.218182: Borderline risk (3.0)
## | | | | | | | | | Systolic > 0.218182: Low risk (9.0)
## | | | | | | | | Systolic > 0.309091
## | | | | | | | | | isMale <= 0
## | | | | | | | | | | isHypertensive <= 0: Borderline risk (17.0/1.0)
## | | | | | | | | | | isHypertensive > 0: Low risk (4.0)
## | | | | | | | | | isMale > 0
## | | | | | | | | | | Systolic <= 0.9
## | | | | | | | | | | | HDL <= 0.4625
## | | | | | | | | | | | | isDiabetic <= 0: Borderline risk (8.0/1.0)
## | | | | | | | | | | | | isDiabetic > 0: Intermediate risk (2.0)
## | | | | | | | | | | | HDL > 0.4625: Borderline risk (23.0)
## | | | | | | | | | | Systolic > 0.9: Intermediate risk (2.0)
## | | | | | | HDL > 0.8125
## | | | | | | | isMale <= 0: Low risk (17.0)
## | | | | | | | isMale > 0
## | | | | | | | | Age <= 0.076923: Low risk (3.0)
## | | | | | | | | Age > 0.076923: Intermediate risk (2.0)
## | | | | isBlack > 0
## | | | | | isDiabetic <= 0
## | | | | | | Systolic <= 0.554545
## | | | | | | | Systolic <= 0.245455
## | | | | | | | | isMale <= 0: Low risk (2.0)
## | | | | | | | | isMale > 0: Borderline risk (15.0/1.0)
## | | | | | | | Systolic > 0.245455
## | | | | | | | | isHypertensive <= 0: Low risk (20.0/2.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | HDL <= 0.4625: Borderline risk (5.0/1.0)
## | | | | | | | | | HDL > 0.4625: Low risk (5.0)
## | | | | | | Systolic > 0.554545
## | | | | | | | isSmoker <= 0
## | | | | | | | | isMale <= 0
## | | | | | | | | | Age <= 0.153846: Borderline risk (5.0/1.0)
## | | | | | | | | | Age > 0.153846: Low risk (6.0)
## | | | | | | | | isMale > 0
## | | | | | | | | | Cholesterol <= 0.7
## | | | | | | | | | | Systolic <= 0.718182: Borderline risk (3.0)
## | | | | | | | | | | Systolic > 0.718182: Intermediate risk (2.0)
## | | | | | | | | | Cholesterol > 0.7: Borderline risk (10.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Age <= 0: Borderline risk (5.0/1.0)
## | | | | | | | | Age > 0
## | | | | | | | | | Cholesterol <= 0.071429: Low risk (2.0)
## | | | | | | | | | Cholesterol > 0.071429
## | | | | | | | | | | Cholesterol <= 0.871429: Intermediate risk (12.0)
## | | | | | | | | | | Cholesterol > 0.871429: High risk (3.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | Systolic <= 0.309091
## | | | | | | | HDL <= 0.4375
## | | | | | | | | Age <= 0.205128: Intermediate risk (2.0)
## | | | | | | | | Age > 0.205128: Borderline risk (3.0)
## | | | | | | | HDL > 0.4375: Low risk (6.0)
## | | | | | | Systolic > 0.309091
## | | | | | | | isHypertensive <= 0
## | | | | | | | | Cholesterol <= 0.314286: Borderline risk (5.0/1.0)
## | | | | | | | | Cholesterol > 0.314286: Intermediate risk (10.0/2.0)
## | | | | | | | isHypertensive > 0
## | | | | | | | | Age <= 0.153846
## | | | | | | | | | Systolic <= 0.881818: Intermediate risk (9.0/1.0)
## | | | | | | | | | Systolic > 0.881818: High risk (2.0)
## | | | | | | | | Age > 0.153846: High risk (6.0)
## | | Age > 0.282051
## | | | Systolic <= 0.254545
## | | | | isDiabetic <= 0
## | | | | | isHypertensive <= 0: Low risk (20.0)
## | | | | | isHypertensive > 0
## | | | | | | isMale <= 0
## | | | | | | | Cholesterol <= 0.385714: Low risk (5.0)
## | | | | | | | Cholesterol > 0.385714: Borderline risk (15.0/1.0)
## | | | | | | isMale > 0: Intermediate risk (6.0/1.0)
## | | | | isDiabetic > 0
## | | | | | Age <= 0.435897
## | | | | | | isHypertensive <= 0
## | | | | | | | Systolic <= 0.2: Borderline risk (21.0/1.0)
## | | | | | | | Systolic > 0.2: Low risk (3.0/1.0)
## | | | | | | isHypertensive > 0: Low risk (3.0)
## | | | | | Age > 0.435897: Intermediate risk (10.0)
## | | | Systolic > 0.254545
## | | | | isMale <= 0
## | | | | | isDiabetic <= 0
## | | | | | | Age <= 0.384615
## | | | | | | | HDL <= 0.5125: Intermediate risk (2.0)
## | | | | | | | HDL > 0.5125: Low risk (12.0)
## | | | | | | Age > 0.384615
## | | | | | | | Cholesterol <= 0.814286
## | | | | | | | | Systolic <= 0.936364
## | | | | | | | | | Cholesterol <= 0.414286
## | | | | | | | | | | isSmoker <= 0
## | | | | | | | | | | | Age <= 0.512821: Low risk (2.0)
## | | | | | | | | | | | Age > 0.512821: Borderline risk (8.0)
## | | | | | | | | | | isSmoker > 0: Borderline risk (7.0)
## | | | | | | | | | Cholesterol > 0.414286
## | | | | | | | | | | Age <= 0.512821: Borderline risk (5.0)
## | | | | | | | | | | Age > 0.512821: Intermediate risk (3.0)
## | | | | | | | | Systolic > 0.936364: Intermediate risk (2.0)
## | | | | | | | Cholesterol > 0.814286: Low risk (3.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | isHypertensive <= 0
## | | | | | | | Systolic <= 0.609091
## | | | | | | | | isBlack <= 0
## | | | | | | | | | Age <= 0.333333: Low risk (2.0)
## | | | | | | | | | Age > 0.333333: Borderline risk (13.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | isSmoker <= 0: Borderline risk (4.0)
## | | | | | | | | | isSmoker > 0: Intermediate risk (3.0)
## | | | | | | | Systolic > 0.609091
## | | | | | | | | isBlack <= 0: Intermediate risk (6.0/1.0)
## | | | | | | | | isBlack > 0: High risk (2.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Systolic <= 0.827273
## | | | | | | | | isBlack <= 0: Intermediate risk (9.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | Cholesterol <= 0.814286: Intermediate risk (7.0)
## | | | | | | | | | Cholesterol > 0.814286: High risk (2.0)
## | | | | | | | Systolic > 0.827273: High risk (3.0)
## | | | | isMale > 0
## | | | | | Cholesterol <= 0.914286
## | | | | | | isSmoker <= 0
## | | | | | | | isDiabetic <= 0: Intermediate risk (18.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | isHypertensive <= 0: Intermediate risk (6.0/1.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Age <= 0.435897: Borderline risk (4.0)
## | | | | | | | | | Age > 0.435897: Intermediate risk (2.0)
## | | | | | | isSmoker > 0
## | | | | | | | isDiabetic <= 0
## | | | | | | | | isHypertensive <= 0: Intermediate risk (7.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Systolic <= 0.690909: Intermediate risk (7.0)
## | | | | | | | | | Systolic > 0.690909: High risk (4.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | Cholesterol <= 0.128571: Intermediate risk (4.0)
## | | | | | | | | Cholesterol > 0.128571: High risk (11.0/1.0)
## | | | | | Cholesterol > 0.914286
## | | | | | | isHypertensive <= 0: Intermediate risk (2.0)
## | | | | | | isHypertensive > 0: Borderline risk (7.0)
## Age > 0.564103
## | Systolic <= 0.5
## | | isDiabetic <= 0
## | | | HDL <= 0.15
## | | | | Systolic <= 0.190909
## | | | | | isMale <= 0: Low risk (2.0)
## | | | | | isMale > 0: Intermediate risk (2.0)
## | | | | Systolic > 0.190909: High risk (9.0)
## | | | HDL > 0.15
## | | | | Age <= 0.692308
## | | | | | isSmoker <= 0
## | | | | | | Systolic <= 0.172727: Low risk (4.0/1.0)
## | | | | | | Systolic > 0.172727
## | | | | | | | Age <= 0.589744: Intermediate risk (3.0/1.0)
## | | | | | | | Age > 0.589744: Borderline risk (22.0)
## | | | | | isSmoker > 0
## | | | | | | Age <= 0.589744: Borderline risk (3.0)
## | | | | | | Age > 0.589744: Intermediate risk (9.0/1.0)
## | | | | Age > 0.692308
## | | | | | HDL <= 0.975
## | | | | | | Systolic <= 0.427273
## | | | | | | | Cholesterol <= 0.057143
## | | | | | | | | Cholesterol <= 0.028571: Intermediate risk (5.0)
## | | | | | | | | Cholesterol > 0.028571: Borderline risk (4.0)
## | | | | | | | Cholesterol > 0.057143
## | | | | | | | | isSmoker <= 0: Intermediate risk (25.0/2.0)
## | | | | | | | | isSmoker > 0
## | | | | | | | | | Age <= 0.769231: Intermediate risk (4.0)
## | | | | | | | | | Age > 0.769231
## | | | | | | | | | | Systolic <= 0.072727: Intermediate risk (2.0)
## | | | | | | | | | | Systolic > 0.072727: High risk (4.0)
## | | | | | | Systolic > 0.427273: High risk (5.0)
## | | | | | HDL > 0.975: Borderline risk (5.0)
## | | isDiabetic > 0
## | | | isSmoker <= 0
## | | | | Systolic <= 0.318182
## | | | | | Age <= 0.820513: Intermediate risk (18.0/1.0)
## | | | | | Age > 0.820513
## | | | | | | isHypertensive <= 0
## | | | | | | | Age <= 0.948718: Intermediate risk (4.0)
## | | | | | | | Age > 0.948718: High risk (4.0/1.0)
## | | | | | | isHypertensive > 0: High risk (3.0)
## | | | | Systolic > 0.318182: High risk (10.0/1.0)
## | | | isSmoker > 0
## | | | | isHypertensive <= 0
## | | | | | isBlack <= 0
## | | | | | | Age <= 0.794872: Intermediate risk (4.0)
## | | | | | | Age > 0.794872: High risk (2.0)
## | | | | | isBlack > 0: High risk (4.0)
## | | | | isHypertensive > 0: High risk (28.0)
## | Systolic > 0.5
## | | Age <= 0.589744
## | | | isDiabetic <= 0: Borderline risk (4.0/1.0)
## | | | isDiabetic > 0: High risk (7.0/1.0)
## | | Age > 0.589744: High risk (141.0/7.0)
##
## Number of Leaves : 131
##
## Size of the tree : 261
# Calculate performance metrics
accuracy_I1 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I1 <-( 1 - accuracy_I1)
sensitivity_I1 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I1 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I1 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_I1, "\n")
## Accuracy: 0.7753165
cat("Error Rate: ", error_rate_I1, "\n")
## Error Rate: 0.2246835
cat("Sensitivity (Recall): ", sensitivity_I1, "\n")
## Sensitivity (Recall): 0.7638889
cat("Specificity: ", specificity_I1, "\n")
## Specificity: 0.7786885
cat("Precision: ", precision_I1, "\n")
## Precision: 0.7236842
# Display a summary of the decision tree
summary(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Dec 2 16:40:38 2023
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 959 cases (10 attributes) from undefined.data
##
## Decision tree:
##
## Age <= 0.5641026:
## :...HDL <= 0.225:
## : :...Systolic > 0.5545455:
## : : :...Cholesterol <= 0.01428571: Borderline risk (5/1)
## : : : Cholesterol > 0.01428571:
## : : : :...isSmoker > 0: High risk (29/6)
## : : : isSmoker <= 0:
## : : : :...Systolic <= 0.6727273: High risk (3)
## : : : Systolic > 0.6727273:
## : : : :...Age <= 0.4615385: Intermediate risk (9)
## : : : Age > 0.4615385: High risk (3/1)
## : : Systolic <= 0.5545455:
## : : :...isHypertensive > 0:
## : : :...isDiabetic > 0:
## : : : :...isMale <= 0: Intermediate risk (11/1)
## : : : : isMale > 0:
## : : : : :...isBlack <= 0: High risk (3/1)
## : : : : isBlack > 0: Intermediate risk (4/1)
## : : : isDiabetic <= 0:
## : : : :...Systolic > 0.3090909: Intermediate risk (10)
## : : : Systolic <= 0.3090909:
## : : : :...Systolic > 0.1909091: Borderline risk (5/1)
## : : : Systolic <= 0.1909091:
## : : : :...isMale <= 0: Low risk (6)
## : : : isMale > 0: Intermediate risk (3)
## : : isHypertensive <= 0:
## : : :...Age <= 0.02564103: Low risk (6)
## : : Age > 0.02564103:
## : : :...HDL <= 0.0125: Intermediate risk (6)
## : : HDL > 0.0125:
## : : :...Cholesterol <= 0.2:
## : : :...Systolic <= 0.2909091: Low risk (3)
## : : : Systolic > 0.2909091: Intermediate risk (3)
## : : Cholesterol > 0.2:
## : : :...Cholesterol > 0.8714285: Low risk (2/1)
## : : Cholesterol <= 0.8714285:
## : : :...Age <= 0.05128205: Intermediate risk (2)
## : : Age > 0.05128205: Borderline risk (30/4)
## : HDL > 0.225:
## : :...Age <= 0.2820513:
## : :...isBlack <= 0:
## : : :...Cholesterol > 0.5571429:
## : : : :...Systolic <= 0.1636364: Low risk (5)
## : : : : Systolic > 0.1636364:
## : : : : :...isSmoker <= 0:
## : : : : :...Age <= 0.2307692: Low risk (15)
## : : : : : Age > 0.2307692: Borderline risk (8/1)
## : : : : isSmoker > 0:
## : : : : :...HDL <= 0.7375: Borderline risk (33/4)
## : : : : HDL > 0.7375: Low risk (8/1)
## : : : Cholesterol <= 0.5571429:
## : : : :...Systolic <= 0.7181818: Low risk (78)
## : : : Systolic > 0.7181818:
## : : : :...isDiabetic <= 0: Low risk (18/1)
## : : : isDiabetic > 0:
## : : : :...HDL > 0.6375: Borderline risk (11)
## : : : HDL <= 0.6375:
## : : : :...Systolic <= 0.9: Low risk (5)
## : : : Systolic > 0.9: Intermediate risk (2)
## : : isBlack > 0:
## : : :...Systolic <= 0.5363637:
## : : :...isMale <= 0:
## : : : :...Cholesterol <= 0.8285714: Low risk (30/1)
## : : : : Cholesterol > 0.8285714: Borderline risk (2)
## : : : isMale > 0:
## : : : :...isDiabetic > 0:
## : : : :...Systolic <= 0.07272727: Low risk (4)
## : : : : Systolic > 0.07272727: Intermediate risk (9/1)
## : : : isDiabetic <= 0:
## : : : :...isSmoker <= 0:
## : : : :...isHypertensive <= 0: Low risk (9)
## : : : : isHypertensive > 0: Borderline risk (6/1)
## : : : isSmoker > 0:
## : : : :...Age <= 0.1794872: Borderline risk (12/1)
## : : : Age > 0.1794872: Intermediate risk (2)
## : : Systolic > 0.5363637:
## : : :...isHypertensive <= 0:
## : : :...Age <= 0.2051282:
## : : : :...isMale <= 0:
## : : : : :...Age <= 0.1282051: Borderline risk (5)
## : : : : : Age > 0.1282051: Low risk (6/1)
## : : : : isMale > 0:
## : : : : :...Cholesterol <= 0.6857143: Intermediate risk (5/1)
## : : : : Cholesterol > 0.6857143: Borderline risk (8)
## : : : Age > 0.2051282:
## : : : :...isSmoker > 0: Intermediate risk (4)
## : : : isSmoker <= 0:
## : : : :...Age <= 0.2564103: Intermediate risk (2)
## : : : Age > 0.2564103: Low risk (2)
## : : isHypertensive > 0:
## : : :...Systolic > 0.8909091: High risk (7)
## : : Systolic <= 0.8909091:
## : : :...Age <= 0.07692308:
## : : :...HDL <= 0.5625: Intermediate risk (3)
## : : : HDL > 0.5625: Borderline risk (7)
## : : Age > 0.07692308:
## : : :...Age <= 0.1794872: Intermediate risk (7)
## : : Age > 0.1794872:
## : : :...isDiabetic <= 0: Intermediate risk (4/1)
## : : isDiabetic > 0: High risk (2)
## : Age > 0.2820513:
## : :...Systolic > 0.7090909:
## : :...Systolic <= 0.9:
## : : :...isSmoker <= 0: Intermediate risk (12)
## : : : isSmoker > 0:
## : : : :...Age <= 0.3846154: Intermediate risk (6)
## : : : Age > 0.3846154: High risk (7/1)
## : : Systolic > 0.9:
## : : :...isDiabetic > 0: High risk (4)
## : : isDiabetic <= 0:
## : : :...Systolic <= 0.9363636: Borderline risk (7)
## : : Systolic > 0.9363636: Intermediate risk (5/1)
## : Systolic <= 0.7090909:
## : :...isDiabetic <= 0:
## : :...isMale > 0:
## : : :...Systolic > 0.6636364: Borderline risk (7)
## : : : Systolic <= 0.6636364:
## : : : :...isHypertensive > 0: Intermediate risk (18)
## : : : isHypertensive <= 0:
## : : : :...isSmoker <= 0: Low risk (5)
## : : : isSmoker > 0: Intermediate risk (2)
## : : isMale <= 0:
## : : :...Age <= 0.4871795:
## : : :...Systolic <= 0.3818182: Low risk (19)
## : : : Systolic > 0.3818182:
## : : : :...HDL <= 0.55: Borderline risk (3)
## : : : HDL > 0.55: Low risk (10/1)
## : : Age > 0.4871795:
## : : :...Cholesterol <= 0.3: Low risk (3)
## : : Cholesterol > 0.3:
## : : :...Systolic <= 0.3636364: Borderline risk (13)
## : : Systolic > 0.3636364:
## : : :...Cholesterol <= 0.4142857: Borderline risk (2)
## : : Cholesterol > 0.4142857: Intermediate risk (2)
## : isDiabetic > 0:
## : :...Age > 0.4615385:
## : :...Cholesterol <= 0.3285714: Borderline risk (2)
## : : Cholesterol > 0.3285714: Intermediate risk (19/1)
## : Age <= 0.4615385:
## : :...isSmoker > 0:
## : :...Cholesterol <= 0.5571429:
## : : :...isHypertensive <= 0: Borderline risk (14/2)
## : : : isHypertensive > 0: High risk (2)
## : : Cholesterol > 0.5571429:
## : : :...isBlack <= 0: Intermediate risk (3)
## : : isBlack > 0: High risk (3/1)
## : isSmoker <= 0:
## : :...isMale <= 0:
## : :...isHypertensive <= 0:
## : : :...HDL <= 0.675: Borderline risk (8/1)
## : : : HDL > 0.675: Low risk (4)
## : : isHypertensive > 0:
## : : :...Systolic <= 0.2909091: Low risk (2)
## : : Systolic > 0.2909091: Intermediate risk (2)
## : isMale > 0:
## : :...Systolic <= 0.07272727: Borderline risk (12)
## : Systolic > 0.07272727: [S1]
## Age > 0.5641026:
## :...Systolic > 0.5: High risk (128/10)
## Systolic <= 0.5:
## :...isDiabetic > 0:
## :...isSmoker > 0:
## : :...isHypertensive > 0: High risk (22)
## : : isHypertensive <= 0:
## : : :...Age <= 0.7948718: Intermediate risk (5/1)
## : : Age > 0.7948718: High risk (4)
## : isSmoker <= 0:
## : :...Systolic > 0.3181818: High risk (9/1)
## : Systolic <= 0.3181818:
## : :...Age > 0.9230769: High risk (4)
## : Age <= 0.9230769:
## : :...isHypertensive <= 0: Intermediate risk (8)
## : isHypertensive > 0:
## : :...Age <= 0.8205128: Intermediate risk (11/1)
## : Age > 0.8205128: High risk (2)
## isDiabetic <= 0:
## :...HDL <= 0.15:
## :...Systolic > 0.1909091: High risk (9)
## : Systolic <= 0.1909091:
## : :...isMale <= 0: Low risk (2)
## : isMale > 0: Intermediate risk (2)
## HDL > 0.15:
## :...Systolic > 0.4272727:
## :...Age <= 0.7692308: Borderline risk (5)
## : Age > 0.7692308: High risk (5)
## Systolic <= 0.4272727:
## :...Cholesterol > 0.7142857:
## :...Age <= 0.8974359: Intermediate risk (12)
## : Age > 0.8974359:
## : :...Systolic <= 0.2090909: Intermediate risk (3/1)
## : Systolic > 0.2090909: High risk (3)
## Cholesterol <= 0.7142857:
## :...Systolic > 0.2909091: Intermediate risk (15/1)
## Systolic <= 0.2909091:
## :...isHypertensive > 0:
## :...Systolic > 0.1727273: Borderline risk (7/1)
## : Systolic <= 0.1727273:
## : :...Age <= 0.6923077: Low risk (3)
## : Age > 0.6923077: Intermediate risk (3)
## isHypertensive <= 0:
## :...HDL <= 0.6: Intermediate risk (7)
## HDL > 0.6:
## :...Cholesterol <= 0.3714286: Intermediate risk (3)
## Cholesterol > 0.3714286:
## :...Systolic <= 0.05454545: Intermediate risk (2)
## Systolic > 0.05454545: Borderline risk (18)
##
## SubTree [S1]
##
## isHypertensive <= 0: Intermediate risk (5)
## isHypertensive > 0: Borderline risk (4)
##
##
## Evaluation on training data (959 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 105 55( 5.7%) <<
##
##
## (a) (b) (c) (d) <-classified as
## ---- ---- ---- ----
## 239 7 (a): class Low risk
## 1 217 3 (b): class Borderline risk
## 4 8 220 18 (c): class Intermediate risk
## 1 2 11 228 (d): class High risk
##
##
## Attribute usage:
##
## 100.00% Age
## 100.00% Systolic
## 79.87% HDL
## 49.01% isDiabetic
## 47.55% Cholesterol
## 34.62% isBlack
## 34.62% isHypertensive
## 31.39% isSmoker
## 26.07% isMale
##
##
## Time: 0.0 secs
Analysis: The C5 model demonstrates strong predictive capabilities with an accuracy of 78.37%. It effectively identifies instances of low risk (sensitivity of 80.6%) and maintains high specificity (77.6%) in recognizing non-low-risk instances. The precision of 72.26% highlights the accuracy of positive predictions. The model’s tree structure, comprising 120 nodes, reflects its complexity in capturing patterns within the data. These results suggest a well-balanced model with the potential for reliable classification across multiple risk categories.
Age plays a critical role as it is the root in risk determination. For individuals with an age value of 0.5641026 or less, the risk varies based on other factors.
HDL cholesterol and systolic blood pressure are the next important attributes, with lower HDL and higher systolic values generally increasing the risk classification.
Cholesterol levels are used to further stratify risk, especially when combined with smoking status and systolic blood pressure measurements.
For individuals who are hypertensive or diabetic, the risk of being classified as ‘High’ increases, particularly if they are also male and non-black.
2-partition the data into ( 70% training, 30% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# install.packages("C50")
library(C50)
# Define the formula
myFormula <- Risk ~ .
# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)
# Display a summary of the decision tree
print(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
## Classification Tree
## Number of samples: 1132
## Number of predictors: 9
##
## Tree size: 135
##
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)
# Calculate performance metrics
accuracy_I2 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I2 <-( 1 - accuracy_I2)
sensitivity_I2 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I2 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I2 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_I2, "\n")
## Accuracy: 0.7753165
cat("Error Rate: ", error_rate_I2, "\n")
## Error Rate: 0.2246835
cat("Sensitivity (Recall): ", sensitivity_I2, "\n")
## Sensitivity (Recall): 0.7638889
cat("Specificity: ", specificity_I2, "\n")
## Specificity: 0.7786885
cat("Precision: ", precision_I2, "\n")
## Precision: 0.7236842
# Display a summary of the decision tree
summary(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Dec 2 16:40:38 2023
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 1132 cases (10 attributes) from undefined.data
##
## Decision tree:
##
## Age > 0.5641026:
## :...Systolic > 0.5:
## : :...Age > 0.5897436: High risk (141/7)
## : : Age <= 0.5897436:
## : : :...isDiabetic <= 0: Borderline risk (4/1)
## : : isDiabetic > 0: High risk (7/1)
## : Systolic <= 0.5:
## : :...isDiabetic > 0:
## : :...isSmoker > 0:
## : : :...isHypertensive > 0: High risk (28)
## : : : isHypertensive <= 0:
## : : : :...isBlack > 0: High risk (4)
## : : : isBlack <= 0:
## : : : :...Age <= 0.7948718: Intermediate risk (4)
## : : : Age > 0.7948718: High risk (2)
## : : isSmoker <= 0:
## : : :...Systolic > 0.3181818: High risk (10/1)
## : : Systolic <= 0.3181818:
## : : :...Age <= 0.8205128: Intermediate risk (18/1)
## : : Age > 0.8205128:
## : : :...isHypertensive > 0: High risk (3)
## : : isHypertensive <= 0:
## : : :...Age <= 0.948718: Intermediate risk (4)
## : : Age > 0.948718: High risk (4/1)
## : isDiabetic <= 0:
## : :...HDL <= 0.15:
## : :...Systolic > 0.1909091: High risk (9)
## : : Systolic <= 0.1909091:
## : : :...isMale <= 0: Low risk (2)
## : : isMale > 0: Intermediate risk (2)
## : HDL > 0.15:
## : :...Age <= 0.6923077:
## : :...isSmoker > 0:
## : : :...Age <= 0.5897436: Borderline risk (3)
## : : : Age > 0.5897436: Intermediate risk (9/1)
## : : isSmoker <= 0:
## : : :...Systolic <= 0.1727273: Low risk (4/1)
## : : Systolic > 0.1727273:
## : : :...Age <= 0.5897436: Intermediate risk (3/1)
## : : Age > 0.5897436: Borderline risk (22)
## : Age > 0.6923077:
## : :...HDL > 0.975: Borderline risk (5)
## : HDL <= 0.975:
## : :...Systolic > 0.4272727: High risk (5)
## : Systolic <= 0.4272727:
## : :...Cholesterol <= 0.05714286:
## : :...Cholesterol <= 0.02857143: Intermediate risk (5)
## : : Cholesterol > 0.02857143: Borderline risk (4)
## : Cholesterol > 0.05714286:
## : :...isSmoker <= 0:
## : :...Age <= 0.9230769: Intermediate risk (19)
## : : Age > 0.9230769:
## : : :...isMale <= 0: Intermediate risk (4)
## : : isMale > 0: High risk (2)
## : isSmoker > 0:
## : :...Age <= 0.7692308: Intermediate risk (4)
## : Age > 0.7692308:
## : :...Systolic <= 0.07272727: Intermediate risk (2)
## : Systolic > 0.07272727: High risk (4)
## Age <= 0.5641026:
## :...HDL <= 0.225:
## :...Systolic > 0.5545455:
## : :...Cholesterol <= 0.01428571: Borderline risk (5/1)
## : : Cholesterol > 0.01428571:
## : : :...isSmoker <= 0:
## : : :...isDiabetic <= 0: Intermediate risk (12/1)
## : : : isDiabetic > 0:
## : : : :...isMale <= 0: Intermediate risk (4/1)
## : : : isMale > 0: High risk (5)
## : : isSmoker > 0:
## : : :...isDiabetic <= 0:
## : : :...Cholesterol <= 0.2428571: Intermediate risk (2)
## : : : Cholesterol > 0.2428571: High risk (13/2)
## : : isDiabetic > 0:
## : : :...HDL <= 0.2: High risk (17)
## : : HDL > 0.2: Intermediate risk (3/1)
## : Systolic <= 0.5545455:
## : :...isDiabetic <= 0:
## : :...isSmoker > 0:
## : : :...isBlack > 0: Intermediate risk (13/1)
## : : : isBlack <= 0:
## : : : :...Systolic <= 0.3090909: Borderline risk (9/1)
## : : : Systolic > 0.3090909: Intermediate risk (2)
## : : isSmoker <= 0:
## : : :...Age <= 0.2564103: Low risk (13)
## : : Age > 0.2564103:
## : : :...isHypertensive <= 0:
## : : :...Cholesterol <= 0.2571429: Intermediate risk (2)
## : : : Cholesterol > 0.2571429: Borderline risk (12/1)
## : : isHypertensive > 0:
## : : :...Systolic <= 0.08181818: Low risk (3)
## : : Systolic > 0.08181818:
## : : :...Age <= 0.5384616: Intermediate risk (5)
## : : Age > 0.5384616: Borderline risk (2)
## : isDiabetic > 0:
## : :...Age > 0.4102564:
## : :...Cholesterol <= 0.2714286: Low risk (3/1)
## : : Cholesterol > 0.2714286: Intermediate risk (12/1)
## : Age <= 0.4102564:
## : :...Cholesterol <= 0.3571429: Intermediate risk (8/1)
## : Cholesterol > 0.3571429:
## : :...isHypertensive <= 0: Borderline risk (17/1)
## : isHypertensive > 0:
## : :...Cholesterol > 0.6857143: Intermediate risk (4)
## : Cholesterol <= 0.6857143:
## : :...isSmoker <= 0: Borderline risk (3)
## : isSmoker > 0: High risk (2)
## HDL > 0.225:
## :...Age > 0.2820513:
## :...Systolic <= 0.2545455:
## : :...Age > 0.5384616: Intermediate risk (5)
## : : Age <= 0.5384616:
## : : :...isDiabetic <= 0:
## : : :...isHypertensive <= 0: Low risk (20)
## : : : isHypertensive > 0:
## : : : :...Cholesterol <= 0.4:
## : : : :...isMale <= 0: Low risk (5)
## : : : : isMale > 0: Intermediate risk (2)
## : : : Cholesterol > 0.4:
## : : : :...isSmoker <= 0: Low risk (2)
## : : : isSmoker > 0: Borderline risk (14)
## : : isDiabetic > 0:
## : : :...Age > 0.4358974: Intermediate risk (8)
## : : Age <= 0.4358974:
## : : :...isHypertensive > 0: Low risk (3)
## : : isHypertensive <= 0:
## : : :...Systolic <= 0.2: Borderline risk (21/1)
## : : Systolic > 0.2: Low risk (3/1)
## : Systolic > 0.2545455:
## : :...isMale > 0:
## : :...Cholesterol > 0.9142857:
## : : :...isHypertensive <= 0: Intermediate risk (2)
## : : : isHypertensive > 0: Borderline risk (7)
## : : Cholesterol <= 0.9142857:
## : : :...isSmoker <= 0:
## : : :...isDiabetic <= 0: Intermediate risk (18)
## : : : isDiabetic > 0:
## : : : :...isHypertensive <= 0: Intermediate risk (6/1)
## : : : isHypertensive > 0:
## : : : :...Age <= 0.4358974: Borderline risk (4)
## : : : Age > 0.4358974: Intermediate risk (2)
## : : isSmoker > 0:
## : : :...isDiabetic > 0:
## : : :...Cholesterol <= 0.1285714: Intermediate risk (4)
## : : : Cholesterol > 0.1285714: High risk (11/1)
## : : isDiabetic <= 0:
## : : :...Systolic <= 0.6909091: Intermediate risk (11)
## : : Systolic > 0.6909091: [S1]
## : isMale <= 0:
## : :...isDiabetic > 0:
## : :...isHypertensive <= 0:
## : : :...Systolic > 0.6090909:
## : : : :...isBlack <= 0: Intermediate risk (6/1)
## : : : : isBlack > 0: High risk (2)
## : : : Systolic <= 0.6090909:
## : : : :...isBlack <= 0:
## : : : :...Age <= 0.3333333: Low risk (2)
## : : : : Age > 0.3333333: Borderline risk (13)
## : : : isBlack > 0:
## : : : :...isSmoker <= 0: Borderline risk (4)
## : : : isSmoker > 0: Intermediate risk (3)
## : : isHypertensive > 0:
## : : :...Systolic > 0.8272727: High risk (3)
## : : Systolic <= 0.8272727:
## : : :...isBlack <= 0: Intermediate risk (9)
## : : isBlack > 0:
## : : :...Cholesterol <= 0.8142857: Intermediate risk (7)
## : : Cholesterol > 0.8142857: High risk (2)
## : isDiabetic <= 0:
## : :...Age <= 0.3846154:
## : :...HDL <= 0.5125: Intermediate risk (2)
## : : HDL > 0.5125: Low risk (12)
## : Age > 0.3846154:
## : :...Cholesterol > 0.8142857: Low risk (3/1)
## : Cholesterol <= 0.8142857:
## : :...Systolic > 0.9363636: Intermediate risk (2)
## : Systolic <= 0.9363636:
## : :...Cholesterol > 0.4142857:
## : :...Age <= 0.5128205: Borderline risk (5)
## : : Age > 0.5128205: Intermediate risk (3)
## : Cholesterol <= 0.4142857:
## : :...HDL <= 0.625: Borderline risk (10)
## : HDL > 0.625: [S2]
## Age <= 0.2820513:
## :...Systolic <= 0.1636364:
## :...isBlack <= 0: Low risk (44)
## : isBlack > 0:
## : :...isMale <= 0: Low risk (9)
## : isMale > 0:
## : :...Systolic <= 0.09090909: Low risk (6)
## : Systolic > 0.09090909: Intermediate risk (4)
## Systolic > 0.1636364:
## :...isBlack > 0:
## :...isDiabetic > 0:
## : :...Systolic <= 0.3090909:
## : : :...HDL > 0.45: Low risk (6)
## : : : HDL <= 0.45:
## : : : :...Age <= 0.2051282: Intermediate risk (2)
## : : : Age > 0.2051282: Borderline risk (3)
## : : Systolic > 0.3090909:
## : : :...isHypertensive <= 0:
## : : :...Cholesterol <= 0.3142857: Borderline risk (5/1)
## : : : Cholesterol > 0.3142857: Intermediate risk (10/2)
## : : isHypertensive > 0:
## : : :...Age > 0.1538462: High risk (6)
## : : Age <= 0.1538462:
## : : :...Systolic <= 0.8818182: Intermediate risk (9/1)
## : : Systolic > 0.8818182: High risk (2)
## : isDiabetic <= 0:
## : :...Systolic <= 0.5545455:
## : :...Systolic <= 0.2454545:
## : : :...isMale <= 0: Low risk (2)
## : : : isMale > 0: Borderline risk (15/1)
## : : Systolic > 0.2454545:
## : : :...isHypertensive <= 0: Low risk (20/2)
## : : isHypertensive > 0:
## : : :...HDL <= 0.4625: Borderline risk (5/1)
## : : HDL > 0.4625: Low risk (5)
## : Systolic > 0.5545455:
## : :...isSmoker <= 0:
## : :...isMale <= 0:
## : : :...Age <= 0.1538462: Borderline risk (5/1)
## : : : Age > 0.1538462: Low risk (6)
## : : isMale > 0:
## : : :...Systolic <= 0.7181818: Borderline risk (9)
## : : Systolic > 0.7181818:
## : : :...Systolic <= 0.8727273: Intermediate risk (2)
## : : Systolic > 0.8727273: Borderline risk (4)
## : isSmoker > 0:
## : :...Cholesterol <= 0.07142857: Low risk (2)
## : Cholesterol > 0.07142857:
## : :...Age <= 0: Borderline risk (5/1)
## : Age > 0: [S3]
## isBlack <= 0:
## :...Cholesterol <= 0.2428571: Low risk (38/1)
## Cholesterol > 0.2428571:
## :...HDL > 0.8125:
## :...isMale <= 0: Low risk (17)
## : isMale > 0:
## : :...Age <= 0.07692308: Low risk (3)
## : Age > 0.07692308: Intermediate risk (2)
## HDL <= 0.8125:
## :...isSmoker <= 0:
## :...Age <= 0.2307692: Low risk (31)
## : Age > 0.2307692:
## : :...isMale <= 0: Low risk (2)
## : isMale > 0: Borderline risk (12)
## isSmoker > 0:
## :...Systolic <= 0.3090909:
## :...Cholesterol <= 0.6571429: Low risk (9)
## : Cholesterol > 0.6571429: Borderline risk (3)
## Systolic > 0.3090909:
## :...isMale <= 0: [S4]
## isMale > 0:
## :...Systolic > 0.9: Intermediate risk (2)
## Systolic <= 0.9:
## :...HDL > 0.4625: Borderline risk (23)
## HDL <= 0.4625: [S5]
##
## SubTree [S1]
##
## isHypertensive <= 0: Intermediate risk (3)
## isHypertensive > 0: High risk (4)
##
## SubTree [S2]
##
## isSmoker <= 0: Low risk (2)
## isSmoker > 0: Borderline risk (5)
##
## SubTree [S3]
##
## Cholesterol <= 0.8714285: Intermediate risk (12)
## Cholesterol > 0.8714285: High risk (3/1)
##
## SubTree [S4]
##
## isHypertensive <= 0: Borderline risk (17/1)
## isHypertensive > 0: Low risk (4)
##
## SubTree [S5]
##
## isDiabetic <= 0: Borderline risk (8/1)
## isDiabetic > 0: Intermediate risk (2)
##
##
## Evaluation on training data (1132 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 135 48( 4.2%) <<
##
##
## (a) (b) (c) (d) <-classified as
## ---- ---- ---- ----
## 274 3 4 (a): class Low risk
## 1 270 (b): class Borderline risk
## 5 6 265 14 (c): class Intermediate risk
## 1 4 10 275 (d): class High risk
##
##
## Attribute usage:
##
## 100.00% Age
## 100.00% Systolic
## 79.77% HDL
## 65.90% isDiabetic
## 46.73% isSmoker
## 45.23% Cholesterol
## 40.28% isBlack
## 30.65% isMale
## 29.24% isHypertensive
##
##
## Time: 0.0 secs
Analysis: The C5 model achieved an accuracy of 78.07%, demonstrating its proficiency in making correct predictions across all classes. It exhibits robust sensitivity (77.23%), effectively identifying instances of high risk. The model’s specificity (78.31%) suggests improved accuracy in correctly identifying non-high-risk instances compared to the previous configuration. The precision of 72.90% reflects the accuracy of positive predictions. The tree structure, comprising 125 nodes, signifies a moderate level of complexity. Overall, the model performs well, with enhanced specificity, showcasing its suitability for this classification task.
Age is the primary split, indicating its importance as a predictor. For individuals older than 0.5641026 , the risk generally increases.
Systolic blood pressure is another crucial factor, with higher values leading to a ‘High risk’ classification, particularly in the presence of diabetes.
Diabetes status (isDiabetic) is a significant differentiator for risk levels. Diabetic individuals tend to be classified as ‘High risk’ more frequently, especially when combined with other risk factors like smoking or higher age.
3-partition the data into ( 80% training, 20% testing):sting):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# install.packages("C50")
library(C50)
# Define the formula
myFormula <- Risk ~ .
# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)
# Display a summary of the decision tree
print(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
## Classification Tree
## Number of samples: 1272
## Number of predictors: 9
##
## Tree size: 155
##
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)
# Display a summary of the decision tree
summary(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Dec 2 16:40:39 2023
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 1272 cases (10 attributes) from undefined.data
##
## Decision tree:
##
## Age > 0.5641026:
## :...Systolic > 0.4909091:
## : :...Age <= 0.5897436:
## : : :...isDiabetic <= 0: Borderline risk (4/1)
## : : : isDiabetic > 0: High risk (7/1)
## : : Age > 0.5897436:
## : : :...isSmoker > 0: High risk (86)
## : : isSmoker <= 0:
## : : :...isMale > 0: High risk (37/1)
## : : isMale <= 0:
## : : :...Age <= 0.6666667: Intermediate risk (5/1)
## : : Age > 0.6666667:
## : : :...isDiabetic > 0: High risk (16)
## : : isDiabetic <= 0:
## : : :...Systolic <= 0.8181818: Intermediate risk (5/1)
## : : Systolic > 0.8181818: High risk (7)
## : Systolic <= 0.4909091:
## : :...isDiabetic > 0:
## : :...isSmoker > 0:
## : : :...isHypertensive > 0: High risk (32)
## : : : isHypertensive <= 0:
## : : : :...isBlack > 0: High risk (6)
## : : : isBlack <= 0:
## : : : :...Age <= 0.6923077: Intermediate risk (5)
## : : : Age > 0.6923077: High risk (4)
## : : isSmoker <= 0:
## : : :...Systolic > 0.3181818:
## : : :...Cholesterol <= 0.2: Intermediate risk (2)
## : : : Cholesterol > 0.2: High risk (10)
## : : Systolic <= 0.3181818:
## : : :...Age <= 0.8205128: Intermediate risk (22/1)
## : : Age > 0.8205128:
## : : :...isHypertensive > 0: High risk (4)
## : : isHypertensive <= 0:
## : : :...isBlack <= 0: High risk (2)
## : : isBlack > 0:
## : : :...Systolic <= 0.04545455: High risk (2)
## : : Systolic > 0.04545455: Intermediate risk (5)
## : isDiabetic <= 0:
## : :...Age <= 0.6923077:
## : :...HDL <= 0.1125:
## : : :...isHypertensive <= 0: Low risk (2)
## : : : isHypertensive > 0: High risk (3)
## : : HDL > 0.1125:
## : : :...isSmoker > 0:
## : : :...Systolic <= 0.08181818: Borderline risk (4)
## : : : Systolic > 0.08181818: Intermediate risk (11/1)
## : : isSmoker <= 0:
## : : :...Systolic <= 0.1818182: Low risk (6/1)
## : : Systolic > 0.1818182:
## : : :...Age <= 0.5897436: Intermediate risk (3/1)
## : : Age > 0.5897436: Borderline risk (27)
## : Age > 0.6923077:
## : :...HDL > 0.975: Borderline risk (5)
## : HDL <= 0.975:
## : :...Age <= 0.7692308:
## : :...Cholesterol <= 0.05714286: Borderline risk (5/1)
## : : Cholesterol > 0.05714286: Intermediate risk (14)
## : Age > 0.7692308:
## : :...isSmoker > 0:
## : :...Systolic > 0.2272727: High risk (10)
## : : Systolic <= 0.2272727:
## : : :...isBlack <= 0: Intermediate risk (3)
## : : isBlack > 0: High risk (3/1)
## : isSmoker <= 0:
## : :...Systolic > 0.4272727: High risk (3)
## : Systolic <= 0.4272727:
## : :...Age <= 0.9230769: Intermediate risk (17)
## : Age > 0.9230769:
## : :...isBlack <= 0: High risk (2)
## : isBlack > 0: Intermediate risk (6/1)
## Age <= 0.5641026:
## :...Age <= 0.3333333:
## :...HDL <= 0.25:
## : :...Systolic > 0.5454546:
## : : :...Systolic <= 0.6181818:
## : : : :...isDiabetic > 0: High risk (6/1)
## : : : : isDiabetic <= 0:
## : : : : :...isHypertensive <= 0: Borderline risk (10)
## : : : : isHypertensive > 0: Intermediate risk (2/1)
## : : : Systolic > 0.6181818:
## : : : :...isSmoker <= 0:
## : : : :...Age <= 0.02564103: Low risk (2)
## : : : : Age > 0.02564103:
## : : : : :...isBlack <= 0: Intermediate risk (5)
## : : : : isBlack > 0: High risk (4/1)
## : : : isSmoker > 0:
## : : : :...isBlack > 0: High risk (14/1)
## : : : isBlack <= 0:
## : : : :...isMale > 0: High risk (4)
## : : : isMale <= 0:
## : : : :...Cholesterol <= 0.4857143: Intermediate risk (3)
## : : : Cholesterol > 0.4857143: High risk (2)
## : : Systolic <= 0.5454546:
## : : :...isSmoker <= 0:
## : : :...isDiabetic <= 0:
## : : : :...Systolic <= 0.3090909: Low risk (14)
## : : : : Systolic > 0.3090909:
## : : : : :...Age <= 0.07692308: Low risk (5)
## : : : : Age > 0.07692308: Borderline risk (11)
## : : : isDiabetic > 0:
## : : : :...isMale <= 0: Intermediate risk (3)
## : : : isMale > 0:
## : : : :...Cholesterol <= 0.4571429: Low risk (3/1)
## : : : Cholesterol > 0.4571429:
## : : : :...isHypertensive <= 0: Borderline risk (5)
## : : : isHypertensive > 0:
## : : : :...isBlack <= 0: Borderline risk (4)
## : : : isBlack > 0: Intermediate risk (2)
## : : isSmoker > 0:
## : : :...Systolic > 0.3272727:
## : : :...Systolic <= 0.3727273: Low risk (3/1)
## : : : Systolic > 0.3727273: Intermediate risk (9)
## : : Systolic <= 0.3272727:
## : : :...Age <= 0: Low risk (3/1)
## : : Age > 0:
## : : :...Age > 0.2051282: Intermediate risk (3/1)
## : : Age <= 0.2051282:
## : : :...HDL <= 0.0125: Intermediate risk (2)
## : : HDL > 0.0125:
## : : :...isDiabetic <= 0: Borderline risk (12)
## : : isDiabetic > 0:
## : : :...Systolic <= 0.1: Borderline risk (6)
## : : Systolic > 0.1: Intermediate risk (2)
## : HDL > 0.25:
## : :...isBlack <= 0:
## : :...isSmoker <= 0:
## : : :...Age <= 0.2307692: Low risk (82)
## : : : Age > 0.2307692:
## : : : :...isDiabetic <= 0: Low risk (15)
## : : : isDiabetic > 0:
## : : : :...isMale <= 0: Low risk (6)
## : : : isMale > 0:
## : : : :...Systolic <= 0.1454545: Low risk (2)
## : : : Systolic > 0.1454545: Borderline risk (14)
## : : isSmoker > 0:
## : : :...HDL > 0.8125:
## : : :...Cholesterol <= 0.7428572: Low risk (29)
## : : : Cholesterol > 0.7428572:
## : : : :...isMale <= 0: Low risk (3)
## : : : isMale > 0: Intermediate risk (2)
## : : HDL <= 0.8125:
## : : :...Systolic <= 0.3090909:
## : : :...Age <= 0.1794872: Low risk (20)
## : : : Age > 0.1794872:
## : : : :...isDiabetic <= 0: Low risk (2)
## : : : isDiabetic > 0: Borderline risk (6)
## : : Systolic > 0.3090909:
## : : :...Cholesterol <= 0.2285714:
## : : :...isMale <= 0: Low risk (6)
## : : : isMale > 0: Intermediate risk (3/1)
## : : Cholesterol > 0.2285714:
## : : :...isMale <= 0:
## : : :...isHypertensive <= 0: Borderline risk (20/1)
## : : : isHypertensive > 0: Low risk (5)
## : : isMale > 0:
## : : :...HDL > 0.55: Borderline risk (27)
## : : HDL <= 0.55: [S1]
## : isBlack > 0:
## : :...Systolic > 0.5363637:
## : :...Age <= 0.1025641:
## : : :...HDL <= 0.625:
## : : : :...Systolic <= 0.8727273: Intermediate risk (9/1)
## : : : : Systolic > 0.8727273: High risk (5/1)
## : : : HDL > 0.625:
## : : : :...isDiabetic <= 0: Borderline risk (15/1)
## : : : isDiabetic > 0:
## : : : :...isSmoker <= 0: Intermediate risk (2)
## : : : isSmoker > 0: Borderline risk (5)
## : : Age > 0.1025641:
## : : :...isDiabetic <= 0:
## : : :...isHypertensive > 0: Intermediate risk (8/1)
## : : : isHypertensive <= 0:
## : : : :...isMale > 0: Intermediate risk (2)
## : : : isMale <= 0:
## : : : :...Cholesterol <= 0.6428571: Low risk (9)
## : : : Cholesterol > 0.6428571: Intermediate risk (2)
## : : isDiabetic > 0:
## : : :...isSmoker <= 0: Intermediate risk (8/1)
## : : isSmoker > 0:
## : : :...isMale > 0: High risk (6)
## : : isMale <= 0: [S2]
## : Systolic <= 0.5363637:
## : :...Cholesterol > 0.8285714:
## : :...isHypertensive <= 0: Borderline risk (13/1)
## : : isHypertensive > 0: Intermediate risk (4/1)
## : Cholesterol <= 0.8285714:
## : :...isMale <= 0:
## : :...isDiabetic <= 0:
## : : :...Cholesterol <= 0.7857143: Low risk (29)
## : : : Cholesterol > 0.7857143:
## : : : :...Age <= 0.2307692: Low risk (2)
## : : : Age > 0.2307692: Borderline risk (2)
## : : isDiabetic > 0:
## : : :...Systolic <= 0.3272727: Low risk (11)
## : : Systolic > 0.3272727:
## : : :...Age <= 0.07692308: Low risk (2)
## : : Age > 0.07692308: Intermediate risk (3)
## : isMale > 0:
## : :...Systolic <= 0.09090909: Low risk (9)
## : Systolic > 0.09090909:
## : :...isDiabetic > 0: Intermediate risk (7)
## : isDiabetic <= 0:
## : :...Age > 0.2820513: Intermediate risk (5)
## : Age <= 0.2820513:
## : :...Systolic <= 0.2545455: Borderline risk (17/1)
## : Systolic > 0.2545455: Low risk (9/1)
## Age > 0.3333333:
## :...Systolic <= 0.2545455:
## :...Cholesterol > 0.8285714:
## : :...isDiabetic > 0: Intermediate risk (9)
## : : isDiabetic <= 0:
## : : :...Age <= 0.4615385: Low risk (3)
## : : Age > 0.4615385: Intermediate risk (2)
## : Cholesterol <= 0.8285714:
## : :...HDL > 0.8375:
## : :...Age <= 0.4615385: Low risk (9)
## : : Age > 0.4615385: Intermediate risk (2)
## : HDL <= 0.8375:
## : :...isMale > 0:
## : :...isHypertensive > 0: Intermediate risk (8/1)
## : : isHypertensive <= 0:
## : : :...HDL <= 0.2125:
## : : :...Cholesterol <= 0.7857143: Intermediate risk (6)
## : : : Cholesterol > 0.7857143: Borderline risk (3)
## : : HDL > 0.2125:
## : : :...isDiabetic > 0: Borderline risk (9/1)
## : : isDiabetic <= 0:
## : : :...HDL <= 0.2375: Borderline risk (5)
## : : HDL > 0.2375: Low risk (4)
## : isMale <= 0:
## : :...Systolic <= 0.09090909:
## : :...isDiabetic <= 0: Low risk (8)
## : : isDiabetic > 0: Intermediate risk (2)
## : Systolic > 0.09090909:
## : :...Cholesterol <= 0.2285714:
## : :...Systolic <= 0.1909091: Low risk (5)
## : : Systolic > 0.1909091: Borderline risk (3)
## : Cholesterol > 0.2285714:
## : :...Systolic > 0.2181818: Low risk (3/1)
## : Systolic <= 0.2181818:
## : :...HDL > 0.475: Borderline risk (20)
## : HDL <= 0.475:
## : :...Age <= 0.4102564: Borderline risk (5/1)
## : Age > 0.4102564: Intermediate risk (2)
## Systolic > 0.2545455:
## :...HDL <= 0.2:
## :...isSmoker > 0:
## : :...Systolic <= 0.3545454: Intermediate risk (3)
## : : Systolic > 0.3545454: High risk (13/2)
## : isSmoker <= 0:
## : :...isMale <= 0: Intermediate risk (12/1)
## : isMale > 0:
## : :...isDiabetic <= 0: Intermediate risk (7/1)
## : isDiabetic > 0: High risk (4)
## HDL > 0.2:
## :...isMale > 0:
## :...Cholesterol > 0.9285714:
## : :...isHypertensive <= 0: Intermediate risk (2)
## : : isHypertensive > 0: Borderline risk (7)
## : Cholesterol <= 0.9285714:
## : :...isDiabetic <= 0: Intermediate risk (34/3)
## : isDiabetic > 0:
## : :...HDL <= 0.6875: High risk (10/1)
## : HDL > 0.6875:
## : :...isSmoker > 0:
## : :...Cholesterol <= 0.3142857: Intermediate risk (3)
## : : Cholesterol > 0.3142857: High risk (3)
## : isSmoker <= 0:
## : :...Cholesterol > 0.2571429: Intermediate risk (6)
## : Cholesterol <= 0.2571429: [S3]
## isMale <= 0:
## :...Cholesterol > 0.8142857:
## :...Age <= 0.4102564: Low risk (6/1)
## : Age > 0.4102564:
## : :...Systolic <= 0.5909091: Intermediate risk (3)
## : Systolic > 0.5909091: High risk (4)
## Cholesterol <= 0.8142857:
## :...isHypertensive > 0:
## :...isDiabetic > 0:
## : :...Systolic <= 0.8272727: Intermediate risk (12)
## : : Systolic > 0.8272727: High risk (2)
## : isDiabetic <= 0:
## : :...Cholesterol <= 0.2142857: Borderline risk (7)
## : Cholesterol > 0.2142857:
## : :...HDL <= 0.55: Intermediate risk (5)
## : HDL > 0.55: Low risk (4)
## isHypertensive <= 0:
## :...HDL > 0.95: Low risk (2)
## HDL <= 0.95:
## :...Cholesterol > 0.4142857:
## :...Systolic > 0.7090909: Intermediate risk (6)
## : Systolic <= 0.7090909:
## : :...Age <= 0.5128205: Borderline risk (9/1)
## : Age > 0.5128205: Intermediate risk (2)
## Cholesterol <= 0.4142857:
## :...isBlack <= 0: Borderline risk (22)
## isBlack > 0:
## :...Cholesterol > 0.3285714: Borderline risk (4)
## Cholesterol <= 0.3285714:
## :...Age <= 0.4358974: Intermediate risk (2)
## Age > 0.4358974: Low risk (2/1)
##
## SubTree [S1]
##
## Cholesterol <= 0.8714285: Intermediate risk (5)
## Cholesterol > 0.8714285: Borderline risk (3)
##
## SubTree [S2]
##
## isHypertensive <= 0: Intermediate risk (2)
## isHypertensive > 0: High risk (2)
##
## SubTree [S3]
##
## isHypertensive <= 0: Intermediate risk (2)
## isHypertensive > 0: Borderline risk (4)
##
##
## Evaluation on training data (1272 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 155 46( 3.6%) <<
##
##
## (a) (b) (c) (d) <-classified as
## ---- ---- ---- ----
## 317 3 2 (a): class Low risk
## 304 1 (b): class Borderline risk
## 7 6 302 9 (c): class Intermediate risk
## 1 17 303 (d): class High risk
##
##
## Attribute usage:
##
## 100.00% Age
## 89.23% Systolic
## 78.38% HDL
## 62.74% isSmoker
## 53.46% isDiabetic
## 45.60% isMale
## 43.08% Cholesterol
## 42.77% isBlack
## 22.33% isHypertensive
##
##
## Time: 0.0 secs
# Calculate performance metrics
accuracy_I3 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I3 <-( 1 - accuracy_I3)
sensitivity_I3 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I3 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I3 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_I3, "\n")
## Accuracy: 0.7753165
cat("Error Rate: ", error_rate_I3, "\n")
## Error Rate: 0.2246835
cat("Sensitivity (Recall): ", sensitivity_I3, "\n")
## Sensitivity (Recall): 0.7638889
cat("Specificity: ", specificity_I3, "\n")
## Specificity: 0.7786885
cat("Precision: ", precision_I3, "\n")
## Precision: 0.7236842
Analysis: The C5 model achieved an accuracy of 64.42%, showcasing its ability to make correct predictions across all classes. It exhibits strong sensitivity (73.03%), effectively identifying instances of high risk. However, the model’s specificity (57.98%) suggests potential for improvement in correctly identifying non-high-risk instances. The precision of 84.42% reflects the accuracy of positive predictions. The tree structure, comprising 92 nodes, indicates a moderate level of complexity. While the model performs reasonably well, there may be opportunities for refinement, particularly in specificity. Overall,The model’s strength lies in identifying clear cases (Low and High risk) .
the root of this tree is the age,
For individuals older than 0.5641026:
Higher systolic blood pressure (greater than 0.4909091) generally indicates higher risk, with smoking and being male increasing the likelihood of being at ‘High risk’.
Diabetic individuals within this age and systolic blood pressure range are also more likely to be at ‘High risk’.
For individuals younger than or equal to 0.5641026:
# Create data frames for each model's summary
summary1 <- data.frame(
Model = "60%training 40%testing",
Accuracy = 78.37,
Sensitivity = 80.6,
Specificity = 77.6,
Precision = 72.26
)
summary2 <- data.frame(
Model = "70%training 30%testing",
Accuracy = 78.07,
Sensitivity = 77.23,
Specificity = 78.31,
Precision = 72.90
)
summary3 <- data.frame(
Model = "80%training 20%testing",
Accuracy = 82.8,
Sensitivity = 73.03,
Specificity = 57.98,
Precision = 84.42
)
# Combine the summaries into a single data frame
comparison_table <- rbind(summary1, summary2, summary3)
# Print the comparison table
print(comparison_table)
## Model Accuracy Sensitivity Specificity Precision
## 1 60%training 40%testing 78.37 80.60 77.60 72.26
## 2 70%training 30%testing 78.07 77.23 78.31 72.90
## 3 80%training 20%testing 82.80 73.03 57.98 84.42
Analysis:
Conclusion:
Opting for RPART with the Gini index involves building a decision tree that maximizes class separation by minimizing impurity. This method, rooted in recursive partitioning, aims to create nodes that group similar instances based on the Gini impurity criterion.
1-partition the data into ( 60% training, 40% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
library(caret)
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)
# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")
# Create a confusion matrix
conf_matrix_rpart <- table(test_pred, testData$Risk)
# Display the confusion matrix
print(conf_matrix_rpart)
##
## test_pred Low risk Borderline risk Intermediate risk High risk
## Low risk 107 48 16 4
## Borderline risk 31 71 28 4
## Intermediate risk 11 53 65 36
## High risk 2 4 38 111
# Calculate performance metrics
accuracy_D1 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D1 <- 1 - accuracy_D1
sensitivity_D1 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D1 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D1 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
# Display performance metrics
cat("Accuracy: ", accuracy_D1, "\n")
## Accuracy: 0.7753165
cat("Error Rate: ", error_rate_D1, "\n")
## Error Rate: 0.2246835
cat("Sensitivity (Recall): ", sensitivity_D1, "\n")
## Sensitivity (Recall): 0.8490566
cat("Specificity: ", specificity_D1, "\n")
## Specificity: 0.7380952
cat("Precision: ", precision_D1, "\n")
## Precision: 0.9782609
Analysis:
The results obtained from the rpart model showcase a balanced performance across various risk categories. The model achieved an overall accuracy of 57.39%, indicating its ability to make correct predictions across all classes. Sensitivity, measuring the model’s capability to identify positive instances, is at 50.81%, demonstrating a reasonable ability to detect true positives. Specificity stands at 60.14%, indicating the model’s proficiency in correctly identifying negative instances. The precision of 53.41% signifies the accuracy of positive predictions.
Root Node: The root node of the tree is based on the age attribute, indicating that age is a primary factor in assessing risk. The threshold value for the split is 0.58; individuals above this threshold are classified into different risk categories primarily based on their systolic blood pressure and diabetes status.
Age-Based Stratification: There’s a clear
stratification by age with two main branches: one for individuals with
Age <= 0.58 and another for
Age > 0.58. This suggests that age is a
significant determinant of risk level in this model.
Systolic Blood Pressure: Within the older age
group (Age > 0.58), systolic blood
pressure is the next significant factor. Those with a systolic pressure
above 0.42 are considered ‘High risk’,
2-partition the data into ( 70% training, 30% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)
# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")
# Create a confusion matrix
conf_matrix <- table(test_pred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## test_pred Low risk Borderline risk Intermediate risk High risk
## Low risk 75 12 15 3
## Borderline risk 34 81 21 1
## Intermediate risk 5 31 54 37
## High risk 2 2 17 66
# Calculate performance metrics
accuracy_D2 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D2 <- 1 - accuracy_D2
sensitivity_D2 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D2 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D2 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
# Display performance metrics
cat("Accuracy: ", accuracy_D2, "\n")
## Accuracy: 0.6052632
cat("Error Rate: ", error_rate_D2, "\n")
## Error Rate: 0.3947368
cat("Sensitivity (Recall): ", sensitivity_D2, "\n")
## Sensitivity (Recall): 0.5912409
cat("Specificity: ", specificity_D2, "\n")
## Specificity: 0.6112853
cat("Precision: ", precision_D2, "\n")
## Precision: 0.6428571
Analysis:
The results from the RPART model reveal a well-balanced performance across different risk categories. The model achieved an overall accuracy of 60.31%, indicating its proficiency in making accurate predictions across all classes. Notably, it demonstrated a sensitivity of 55.10%, effectively identifying instances of low risk, and a specificity of 62.78%, accurately recognizing non-low-risk instances. The precision of 64.29% underscores the model’s accuracy in positive predictions.
Age as a Primary Factor: The tree splits initially on age, with the first division at 0.58. This suggests that age is a significant determinant in assessing risk levels in this model.
Systolic Blood Pressure: Among older individuals (age > 0.58), systolic blood pressure is a critical factor for risk classification. Higher systolic pressure tends to lead to a higher risk assessment.
Diabetes and Smoking Status: At higher systolic
levels, being diabetic or a smoker substantially increases the risk,
often resulting in a ‘High risk’ classification. For instance,
individuals with Age > 0.58 and
Systolic > 0.42 who are diabetic or
smokers are mostly classified as ‘High risk’.
3-partition the data into ( 80% training, 20% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)
# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")
# Create a confusion matrix
conf_matrix <- table(test_pred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## test_pred Low risk Borderline risk Intermediate risk High risk
## Low risk 50 28 9 3
## Borderline risk 19 41 16 0
## Intermediate risk 4 22 36 28
## High risk 2 1 12 45
# Calculate performance metrics
accuracy_D3 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D3 <- 1 - accuracy_D3
sensitivity_D3 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D3 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D3 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
# Display performance metrics
cat("Accuracy: ", accuracy_D3, "\n")
## Accuracy: 0.5443038
cat("Error Rate: ", error_rate_D3, "\n")
## Error Rate: 0.4556962
cat("Sensitivity (Recall): ", sensitivity_D3, "\n")
## Sensitivity (Recall): 0.5394737
cat("Specificity: ", specificity_D3, "\n")
## Specificity: 0.5458333
cat("Precision: ", precision_D3, "\n")
## Precision: 0.4456522
Analysis:
The outcomes of the RPART model showcase a discernible performance across distinct risk categories. The model achieved an overall accuracy of 54.75%, highlighting its capability to make correct predictions across all classes. Specifically, it demonstrated a sensitivity of 51.02%, effectively identifying instances of low risk, and a specificity of 56.42%, accurately recognizing non-low-risk instances. The precision of 54.35% emphasizes the model’s accuracy in positive predictions.
Age as a Primary Split: The tree splits first on age, with a cutoff at 0.58, indicating the prominence of age as a determinant in risk classification.
Systolic Blood Pressure: For individuals above the age cutoff, systolic blood pressure is the next discriminator, particularly for the ‘High risk’ category (systolic < 0.42).
Diabetes Status: For those who are not in the ‘High risk’ category by blood pressure alone, diabetes status is used to further stratify the risk, especially for those within the intermediate age range (Age < 0.71).
# Create data frames for each summary
summary1 <- data.frame(
Model = "60% training 40% testing",
Accuracy = 57.39,
Sensitivity = 50.81,
Specificity = 60.14,
Precision = 53.41
)
summary2 <- data.frame(
Model = "70% training, 30% testing",
Accuracy = 60.31,
Sensitivity = 55.10,
Specificity = 62.78,
Precision = 64.29
)
summary3 <- data.frame(
Model = " 80% training 20% testing",
Accuracy = 54.75,
Sensitivity = 51.02,
Specificity = 56.42,
Precision = 54.35
)
# Combine summaries into a single data frame
comparison_table <- rbind(summary1, summary2, summary3)
# Print the comparison table
print(comparison_table)
## Model Accuracy Sensitivity Specificity Precision
## 1 60% training 40% testing 57.39 50.81 60.14 53.41
## 2 70% training, 30% testing 60.31 55.10 62.78 64.29
## 3 80% training 20% testing 54.75 51.02 56.42 54.35
Observations:
The model trained with 70% of the data for training and 30% for testing exhibits the highest overall performance with the highest accuracy, sensitivity, specificity, and precision.
The 60% training and 40% testing model follows closely with competitive metrics across all categories.
The 80% training and 20% testing model lags behind in accuracy and precision but maintains moderate performance in sensitivity and specificity.
Conclusion: Considering the three models, the 70% training and 30% testing model stands out as the most effective, striking a balance between accuracy, sensitivity, specificity, and precision. It outperforms the other two models, demonstrating its robustness in handling different proportions of training and testing data. the decision tree suggests a hierarchical model where age is the most significant predictor, followed by systolic blood pressure, diabetic status, and smoking status.
the C4.5 model using information Gain emerged as the preferred choice. The C4.5 model exhibited superior predictive performance with a higher accuracy of 82.8% in the (80% training, 20% testing) partitioning , sensitivity, specificity, and precision compared to the other models. The decision to favor C4.5 is grounded in its ability to capture both positive and negative instances effectively, making it well-suited for the dataset characteristics. The model’s strength lies in identifying clear cases (Low and High risk). Age is the primary split, indicating its importance as a predictor. For individuals older than 0.5641026 (normalized value), the risk generally increases.
Clustering models are utilized to group data into distinct clusters or groups. In our case, we will apply the k-means clustering algorithm to our dataset and interpret the results, taking into consideration our knowledge of the class label.
Certain factors can impact the efficacy of the final clusters formed when using k-means clustering that we have to be aware. For instance, outliers: Cluster formation is very sensitive to the presence of outliers as that they can pull the cluster towards itself, thus affecting optimal cluster formation. However, we have already addressed this concern in earlier steps.
cdataset = subset(dataset, select = -c(Risk))
We can now use the rest of the attributes for clustering.
The checking is because K-Means algorithm does not work with categorical data.
# 1- view
str(cdataset)
## 'data.frame': 1000 obs. of 9 variables:
## $ isMale : int 1 0 0 1 0 0 1 1 0 1 ...
## $ isBlack : int 1 0 1 1 0 0 0 0 0 0 ...
## $ isSmoker : int 0 0 1 1 1 1 1 1 1 0 ...
## $ isDiabetic : int 1 1 1 1 0 0 0 1 0 1 ...
## $ isHypertensive: int 1 1 1 0 1 1 0 0 1 1 ...
## $ Age : num 0.2308 0.7436 0.2564 0.0513 0.6667 ...
## $ Systolic : num 0.1 0.7 0.827 0.5 0.4 ...
## $ Cholesterol : num 0.729 0.357 0.243 0.514 0.986 ...
## $ HDL : num 0.15 0.487 0.487 0.325 0.537 ...
It’s clear that all 9 variables are numeric of type integer so we can start working on it with no issues.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
cdataset <- scale(cdataset)
fviz_nbclust(cdataset, kmeans, method = "silhouette")+ labs(subtitle = "silhouette method")
According to silhouette method best number of clusters is K = 2 so will test it along with other high points such as k=4 , k=8.
# 2- prepreocessing
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)
K-means algorithm is non-deterministic, meaning that the clustering outcome can vary each time the algorithm is executed, even when applied to the same dataset. To address this, we will set a seed for the random number generation, ensuring that the results can be reproduced consistently.
# 3- run k-means clustering to find 2 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(cdataset,2)
# print the clusterng result
kmeans.result
## K-means clustering with 2 clusters of sizes 516, 484
##
## Cluster means:
## isMale isBlack isSmoker isDiabetic isHypertensive Age
## 1 -0.02262886 -0.04843516 0.9680116 0.02577952 -0.001627174 -0.02976937
## 2 0.02412499 0.05163749 -1.0320124 -0.02748395 0.001734756 0.03173759
## Systolic Cholesterol HDL
## 1 0.04730009 -0.01946460 -0.007645875
## 2 -0.05042737 0.02075152 0.008151387
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2 2 1 1 1 1 1 1 1 2 1 2 2 1 2 1
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 2 1 1 2 1 2 1 1 2 2 2 2 1 2 2 2
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 2 1 1 2 1 2 1 2 1 1 2 2 2 2 1 1
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 2 1 1 1 2 1 1 2 1 2 2 2 1 1 1 2
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 2 2 1 1 2 1 1 1 1 2 2 2 2 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 1 1 2 2 2 2 2 2 1 2 1 1 2 1 1 1
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 2 2 1 1 2 1 2 1 1 2 2 2 2 1 1 1
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 1 1 2 2 2 2 1 1 1 2 1 2 2 1 2 2
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 1 1 2 2 1 1 2 2 1 1 2 1 2 2 1 1
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 2 1 2 2 1 1 1 1 2 1 1 2 2 2 1
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 2 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 1 2 2 2 2 2 2 2 1 2 2 2 1 1 2 2
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 1 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 1 2 2 1 1 2 2 2 2 2 1 1 2 1 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 1 2 1 2 2 2 2 1 2 2 2 2 1 1 2 1
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 1 1 2 1 2 1 2 2 2 2 1 2 2 1 1 2
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 2 2 2 1 1 1 1 1 1 2 1 2 2 1 1 2
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 2 2 1 2 1 1 1 1 2 2 2 2 1 1 2 1
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 1 2 1 2 2 1 1 1 2 2 1 2 2 1 2 2
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 1 1 2 2 2 2 2 2 1 1 2 2 2 1 1 2
## 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
## 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 1
## 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
## 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 1
## 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 1 2 2 2 1 1 1 2 1 1 1 1 2 1 2 1
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
## 1 2 1 1 2 2 1 1 1 1 2 2 2 2 1 1
## 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## 1 1 1 1 2 2 1 2 2 1 2 2 2 1 1 2
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448
## 1 2 1 2 2 1 1 1 1 2 1 2 1 1 2 2
## 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
## 1 1 2 1 2 1 2 1 2 1 1 2 1 1 1 2
## 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
## 1 2 2 1 1 1 2 1 1 2 1 2 1 1 2 1
## 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512
## 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528
## 1 1 2 1 2 2 1 1 1 1 2 1 2 2 2 2
## 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
## 2 2 1 2 1 2 2 1 2 1 1 1 1 1 2 1
## 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560
## 2 1 2 2 2 2 2 1 1 2 2 1 1 1 1 2
## 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576
## 1 1 2 1 2 1 1 2 2 1 1 2 2 2 2 2
## 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592
## 1 1 2 1 1 2 1 2 2 2 2 1 2 2 1 1
## 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608
## 2 2 2 1 1 2 2 1 1 1 1 1 2 1 1 1
## 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624
## 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 2
## 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
## 2 1 1 2 2 2 1 2 2 1 2 1 2 2 2 1
## 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656
## 1 1 2 2 2 2 1 2 1 2 2 2 1 1 2 1
## 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672
## 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 2
## 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688
## 2 2 2 2 1 1 2 1 1 2 1 2 2 1 1 2
## 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704
## 1 1 2 1 1 2 1 1 2 1 2 1 2 1 2 2
## 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
## 1 2 2 2 2 2 1 1 2 1 2 1 1 2 2 1
## 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736
## 1 1 2 2 1 1 1 1 1 1 2 2 2 1 1 1
## 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752
## 1 2 2 2 1 1 1 2 1 2 1 2 2 1 2 1
## 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768
## 2 2 1 1 1 1 1 1 1 1 2 2 1 1 2 1
## 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784
## 2 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2
## 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800
## 2 1 2 1 1 1 1 2 1 2 2 2 1 1 2 1
## 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816
## 1 2 1 2 2 1 1 2 1 2 2 2 1 2 1 1
## 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832
## 2 1 1 2 1 1 2 1 1 2 2 2 1 1 2 2
## 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848
## 1 2 2 1 1 1 2 1 1 2 1 1 2 1 2 2
## 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864
## 1 2 1 2 2 2 2 1 2 1 2 1 2 2 1 1
## 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880
## 2 2 2 1 1 1 2 1 1 1 2 2 1 1 1 2
## 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896
## 1 2 2 1 2 2 2 1 1 2 1 1 2 2 1 1
## 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912
## 1 1 2 1 1 2 2 1 2 2 2 1 1 2 1 1
## 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
## 1 2 2 1 1 1 2 1 2 1 1 1 2 1 1 2
## 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944
## 2 1 1 1 2 2 2 1 1 1 1 2 1 1 2 2
## 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960
## 2 2 2 2 1 2 2 2 2 2 1 1 1 1 1 1
## 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976
## 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 2
## 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992
## 1 2 1 2 1 2 2 2 1 2 2 2 1 2 1 1
## 993 994 995 996 997 998 999 1000
## 2 2 2 1 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 4105.816 3878.629
## (between_SS / total_SS = 11.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
k-means algorithm is used to identify and assign the data to two clusters so that each observation will be assigned to one of the two clusters. From the output, we can observe that two different clusters have been found with sizes 516 and 484, and the within cluster sum of square (WCSS) =11.2% meaning the clusters are kind of compacted. But we need to visualize it to have a better look.
Cluster Plot:
# 4- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeans.result, data = cdataset)
The plot shows overlapping clusters, particularly in the middle, suggesting that some data points are challenging to assign to a specific cluster. An avegrage silhouette coefficient can be more precise so we will calculate it.
The value is between [-1, 1], a score of 1 denotes the best. And the worst value is -1. Values near 0 denote overlapping clusters.
#Average silhouette
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
## cluster size ave.sil.width
## 1 1 516 0.11
## 2 2 484 0.11
The Average Silhouette Coefficient of 0.11 suggests that there is a certain level of similarity among the data points within the clusters formed through the clustering process. However, the coefficient is relatively low, approaching zero, indicating the presence of overlapping clusters.
To measure the quality of the cluster the average BCubed precision and recall of all objects in the data set is computed:
# Cluster assignments and ground truth labels
cluster_assignments <- kmeans.result$cluster
ground_truth <- dataset$Risk
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
n <- length(cluster_assignments)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- cluster_assignments[i]
label <- ground_truth[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(cluster_assignments == cluster)
# Count the total number of items with the same category
total_same_category <- sum(ground_truth == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
precision <- precision_sum / n # Calculate average precision
recall <- recall_sum / n # Calculate average recall
return(list(precision = precision, recall = recall)) }
# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)
# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall
# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
## BCubed Precision: 0.3299589
## BCubed Recall: 0.5317886
The calculated precision value is 0.32996 not a high value. It means that the clusters are not pure; meaning not all data points in a cluster belong to the same category.
On the other hand, the calculated recall value of 0.53179 implies that approximately half of the objrcts belonging to the same categore are correctly assigned to the same cluster.
Conclusion of K=2:
Considering upove results for K=2 in isolation, without considering our knowledge of the class label, it is evident that the performance is suboptimal (less than ideal). Therefore, it is recommended to explore other values for K in order to achieve better clustering results.
# 2- prepreocessing
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)
# 1- run k-means clustering to find 4 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans_result <- kmeans(cdataset, centers = 4, nstart = 25)
#Accessing kmeans_result
print(kmeans_result)
## K-means clustering with 4 clusters of sizes 240, 255, 244, 261
##
## Cluster means:
## isMale isBlack isSmoker isDiabetic isHypertensive Age
## 1 -0.004998499 0.098461545 -1.0320124 -0.002334427 1.00954535 0.04092810
## 2 -0.101538140 -1.061382078 0.9680116 0.124685876 0.02175491 -0.01063463
## 3 0.052771040 0.005581038 -1.0320124 -0.052221191 -0.98955436 0.02269775
## 4 0.054466405 0.941225616 0.9680116 -0.070853124 -0.02447174 -0.04846424
## Systolic Cholesterol HDL
## 1 -0.03065348 -0.08081696 -0.004490818
## 2 0.08003760 0.02566201 -0.052055040
## 3 -0.06987709 0.12065493 0.020586343
## 4 0.01531517 -0.06355382 0.035742390
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1 1 4 4 2 2 2 2 2 1 2 1 1 4 1 2
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 1 4 4 1 2 3 2 4 1 1 1 1 2 3 1 1
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 1 4 4 1 2 1 2 3 4 2 3 3 3 3 4 2
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 3 2 4 4 3 4 2 3 4 3 3 1 4 2 2 1
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 4 3 1 4 4 3 2 4 2 4 3 1 3 3 4 4
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 4 4 1 1 1 3 3 3 4 1 4 4 3 2 4 2
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 1 1 2 2 1 2 1 4 4 3 1 3 3 2 2 2
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 4 2 1 3 3 1 4 2 4 3 4 1 3 4 3 3
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 2 4 3 1 4 4 1 3 2 2 1 2 3 1 2 4
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 1 3 2 3 3 2 2 4 4 3 4 2 3 1 1 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 1 4 2 4 4 2 1 1 2 4 1 2 3 2 2 2
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 4 1 1 3 3 1 3 3 4 1 1 1 2 2 1 3
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 3 3 3 4 2 1 1 3 1 3 1 1 2 4 3 4
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 4 3 2 2 2 4 4 3 1 4 3 4 1 2 4 2
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 2 4 3 1 4 4 3 1 1 1 3 2 4 3 2 2
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 2 3 4 3 3 3 3 4 3 1 1 3 4 2 1 2
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 2 4 3 4 3 2 1 1 1 3 2 3 1 4 2 1
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 1 3 1 4 2 4 4 4 4 1 2 3 3 2 2 3
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 3 1 4 3 4 2 4 2 3 1 1 1 2 4 1 2
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 3 1 2 2 2 2 2 2 1 4 4 2 2 4 4 1
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 2 1 2 1 1 2 2 4 1 3 4 3 3 4 3 1
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 2 2 3 1 3 3 3 1 4 4 3 1 3 2 2 1
## 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
## 1 1 2 2 3 1 1 1 1 1 1 3 4 4 3 4
## 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
## 1 2 3 1 2 1 3 2 2 3 4 3 1 1 3 2
## 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 2 1 1 3 4 4 2 3 2 2 2 4 3 2 3 4
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
## 2 3 4 2 3 1 2 2 4 4 3 3 1 3 2 4
## 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## 2 4 4 2 3 1 2 1 1 2 3 3 1 2 4 3
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448
## 4 1 2 1 1 4 2 2 2 3 2 3 2 4 1 3
## 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
## 4 2 3 2 3 4 1 2 1 4 4 1 4 2 4 3
## 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 2 2 4 2 2 2 4 4 4 4 3 3 3 2 1 2
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
## 4 3 1 2 2 4 3 4 2 3 4 1 2 2 1 2
## 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512
## 2 2 1 3 3 3 1 3 3 3 3 3 1 4 2 3
## 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528
## 4 4 3 2 3 1 2 2 2 4 3 4 3 1 1 1
## 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
## 1 3 4 3 4 1 3 2 1 4 4 4 2 4 3 4
## 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560
## 1 2 3 1 1 1 1 4 2 1 1 4 2 4 4 1
## 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576
## 4 2 1 4 1 4 4 1 3 2 4 3 1 3 1 3
## 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592
## 2 2 1 2 2 1 4 1 1 3 3 4 1 3 4 2
## 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608
## 3 1 1 4 4 3 1 4 2 4 2 4 3 2 2 2
## 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624
## 4 1 2 1 2 4 4 1 4 3 2 4 4 3 4 3
## 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
## 3 4 2 1 1 1 2 1 1 2 3 2 3 3 3 2
## 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656
## 2 2 3 3 1 1 2 3 2 3 3 1 4 4 1 4
## 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672
## 2 4 3 4 3 4 3 4 4 4 2 4 4 4 4 1
## 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688
## 1 3 1 3 2 4 1 2 4 1 4 3 3 4 2 1
## 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704
## 2 4 1 2 4 1 4 4 1 2 3 4 3 4 3 3
## 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
## 4 3 1 3 1 1 4 4 3 2 3 2 2 3 1 4
## 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736
## 4 4 1 1 2 2 2 4 4 2 1 3 1 2 4 4
## 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752
## 2 1 1 3 4 2 4 3 4 3 2 1 1 4 3 2
## 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768
## 3 1 4 2 2 4 2 4 4 2 1 3 2 2 3 4
## 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784
## 3 3 3 2 3 4 3 4 1 3 4 1 1 4 3 3
## 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800
## 3 2 3 4 2 2 4 1 4 3 3 3 2 2 3 4
## 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816
## 2 1 4 3 3 2 2 3 4 3 3 1 2 1 4 2
## 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832
## 1 4 2 3 4 4 1 4 4 3 1 1 2 2 3 1
## 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848
## 2 3 3 2 4 2 3 4 4 1 4 4 3 4 1 1
## 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864
## 2 1 4 1 3 1 3 2 3 4 3 2 3 3 2 4
## 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880
## 3 3 1 4 2 2 1 4 2 4 3 3 4 2 4 3
## 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896
## 4 1 1 2 3 1 1 2 4 3 4 4 1 1 4 2
## 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912
## 2 4 3 2 2 1 1 4 3 3 1 4 2 3 4 4
## 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
## 2 3 1 2 4 2 3 2 1 2 2 2 1 4 2 3
## 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944
## 1 2 4 2 3 1 3 4 4 2 2 1 4 2 3 1
## 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960
## 1 3 1 1 4 1 1 1 1 1 4 2 4 4 4 2
## 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976
## 4 3 2 4 2 2 4 4 2 3 4 4 4 1 2 1
## 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992
## 4 1 4 3 2 3 1 3 2 3 1 3 4 1 4 2
## 993 994 995 996 997 998 999 1000
## 1 1 3 2 3 4 4 3
##
## Within cluster sum of squares by cluster:
## [1] 1648.259 1799.286 1739.876 1778.161
## (between_SS / total_SS = 22.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
We can observe that four different clusters have been found with sizes 240 , 255 ,244 and 261. And the within cluster sum of square (WCSS) =22.5% which means that the cluster less compact and cohesive. Its higher than 2 clusters result which means 2 clusters are better in terms of compactness.
Cluster plot :
# 2- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeans_result, data = cdataset)
As we can see In the cluster plot, it’s evident that there are overlapping clusters.
#3-Average silhouette
library(cluster)
avg_sil <- silhouette(kmeans_result$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
## cluster size ave.sil.width
## 1 1 240 0.13
## 2 2 255 0.12
## 3 3 244 0.12
## 4 4 261 0.13
An Average Silhouette coefficient of 0.12 indicate that the clustering is not very well-defined, and there is ambiguity and overlap between clusters. However, the result is higher than 2 clusters.
# Cluster assignments and ground truth labels
cluster_assignments <- kmeans_result$cluster
ground_truth <- dataset$Risk
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
n <- length(cluster_assignments)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- cluster_assignments[i]
label <- ground_truth[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(cluster_assignments == cluster)
# Count the total number of items with the same category
total_same_category <- sum(ground_truth == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
precision <- precision_sum / n # Calculate average precision
recall <- recall_sum / n # Calculate average recall
return(list(precision = precision, recall = recall)) }
# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)
# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall
# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
## BCubed Precision: 0.336335
## BCubed Recall: 0.2729542
The calculated precision value is 0.336335 not a high value it mean the clusters are not pure.and not all data points in a cluster belong to the same category.
The calculated recall value is 0.2729542 it’s a low result meaning most of the data are not in the same cluster.
Conclusion of K=4:
After applying various evaluation metrics such as the average silhouette coefficient, within-cluster sum of squares ,Bcubed precision and recall.it became clear to us that k=4 Is not a good number of clusters since there is overlapping and the clusters are not pure .And the within cluster sum of square 4 clusters has a higher value than 2 cluster indicating that the 4 clusters less compact .but According to the number of class label its the best among the considered options.
# 2- prepreocessing
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)
# 3- run k-means clustering to find 8 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeansresult <- kmeans(cdataset,8)
# print the clusterng result
kmeansresult
## K-means clustering with 8 clusters of sizes 136, 149, 100, 132, 122, 93, 139, 129
##
## Cluster means:
## isMale isBlack isSmoker isDiabetic isHypertensive Age
## 1 0.6374557 0.9412256 0.96801163 -0.11758451 0.4803719 -0.42674815
## 2 0.9928563 0.1348064 -1.03201240 -0.06416429 0.3789569 -0.26292001
## 3 0.8197539 -0.2403129 -0.09200111 -0.06403000 -0.9895544 0.72578390
## 4 -0.9645589 0.5467726 -0.78958524 0.24399312 0.4189023 0.34795779
## 5 -0.6683239 -0.4868635 0.60735156 -0.12602627 -0.9895544 -0.76525097
## 6 -0.4852307 0.3598234 0.86048345 0.31098443 -0.7316060 0.75221709
## 7 -0.3324182 -0.2401689 -1.03201240 -0.23835629 -0.1410156 -0.05978714
## 8 -0.1272486 -1.0613821 0.96801163 0.14986867 1.0095454 0.08076629
## Systolic Cholesterol HDL
## 1 0.06575189 -0.11568304 -0.01941434
## 2 -0.44527267 -0.11897944 -0.04664307
## 3 -0.13781479 0.97280405 0.09080812
## 4 -0.67231955 0.46294127 0.02679507
## 5 -0.42367631 -0.08839685 -0.13312255
## 6 0.50518690 -0.83938009 0.32754430
## 7 1.08076911 -0.37828447 -0.03913655
## 8 0.11170739 0.12791071 -0.09153707
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2 7 1 1 8 8 5 3 8 7 3 7 2 1 2 6
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 2 1 1 7 5 3 8 1 2 2 4 2 8 3 7 4
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 2 1 6 2 6 2 3 3 6 5 3 7 4 3 5 5
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 2 6 1 1 4 5 8 7 6 5 3 4 5 8 8 2
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 4 2 1 6 2 5 1 8 6 4 4 3 2 1 6
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 6 6 4 4 2 2 7 2 5 7 5 6 4 8 1 8
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 4 2 5 8 7 3 4 1 5 7 7 5 4 5 5 6
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 6 3 4 7 5 2 6 8 3 7 6 2 2 1 7 2
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 8 4 4 7 3 1 4 4 6 8 2 5 7 2 6 1
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 3 8 3 3 5 8 1 1 2 1 8 6 4 7 5
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 2 1 8 1 1 8 4 2 5 6 4 5 7 8 8 5
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 1 2 4 2 3 2 3 7 5 2 4 7 8 8 4 7
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 2 3 7 3 8 2 7 2 7 2 4 2 8 1 4 1
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 6 7 3 8 8 1 6 2 7 1 7 1 4 6 1 8
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 8 3 7 2 1 3 7 4 7 2 7 8 6 7 8 8
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 8 4 1 4 7 4 2 1 3 2 2 7 5 8 4 8
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 5 1 4 3 5 6 4 4 7 5 8 3 7 1 8 7
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 4 7 2 5 8 6 1 5 6 2 8 3 5 5 5 4
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 2 2 6 7 1 8 1 8 7 4 4 7 6 1 2 5
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 7 7 8 5 8 3 5 8 7 6 1 3 8 1 1 4
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 5 2 6 7 2 8 5 4 2 3 6 7 3 5 7 2
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 8 8 4 4 2 2 7 4 1 1 7 4 3 8 3 4
## 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
## 2 7 5 8 4 7 4 4 2 2 2 3 1 1 2 4
## 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
## 4 3 7 4 8 7 2 5 8 7 1 7 2 2 3 6
## 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 6 2 4 3 1 1 8 4 8 5 8 1 7 8 2 4
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
## 8 7 1 8 7 2 3 8 1 6 7 3 7 5 8 1
## 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## 5 1 4 8 2 4 8 4 2 8 7 4 4 5 3 7
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448
## 1 2 8 4 7 5 8 5 8 2 8 3 6 6 4 3
## 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
## 3 8 2 8 7 1 4 3 4 1 1 2 6 6 1 2
## 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 8 8 6 8 6 5 3 6 1 6 7 5 7 5 2 3
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
## 5 4 7 8 5 1 6 6 8 2 1 2 8 5 2 8
## 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512
## 8 5 2 4 4 2 2 5 5 2 4 5 7 1 8 5
## 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528
## 6 6 6 8 3 7 8 6 6 4 5 4 2 7 4 2
## 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
## 7 3 1 5 1 7 2 3 4 6 1 1 3 3 7 1
## 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560
## 2 3 2 7 2 4 4 6 3 7 7 1 8 5 4 7
## 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576
## 6 5 4 1 2 1 6 2 6 3 6 7 4 4 2 3
## 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592
## 3 8 7 8 8 4 6 4 2 2 2 4 4 7 6 8
## 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608
## 4 7 2 4 1 7 4 5 5 3 8 3 3 8 6 8
## 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624
## 1 2 5 4 3 1 6 2 1 3 8 6 1 3 4 7
## 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
## 2 5 6 2 4 7 8 2 7 5 4 6 3 7 3 5
## 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656
## 8 5 7 7 7 2 5 7 5 2 5 2 6 1 7 1
## 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672
## 8 1 3 4 3 5 7 1 1 1 5 1 3 1 1 4
## 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688
## 4 3 4 3 8 5 2 5 1 4 1 7 2 5 6 2
## 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704
## 3 6 7 6 5 2 6 1 4 8 7 6 7 1 3 7
## 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
## 6 7 4 4 2 7 1 1 7 5 7 6 8 5 7 6
## 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736
## 5 1 4 7 8 8 8 1 1 6 2 2 4 5 1 5
## 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752
## 8 4 2 4 6 5 1 2 1 2 5 4 2 5 3 8
## 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768
## 4 2 5 3 8 6 3 6 6 5 2 5 5 8 3 1
## 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784
## 3 7 3 8 7 6 2 1 4 7 5 2 4 1 3 2
## 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800
## 3 5 4 5 8 8 3 2 5 4 4 5 5 3 2 1
## 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816
## 3 2 1 5 7 8 8 2 5 7 2 4 6 2 6 3
## 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832
## 7 1 3 3 1 4 4 1 1 4 4 7 5 5 5 4
## 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848
## 5 4 7 8 1 5 5 5 5 2 5 1 3 4 7 2
## 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864
## 8 2 6 2 4 2 7 3 7 1 5 8 7 7 3 4
## 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880
## 3 3 4 6 8 5 2 6 5 1 2 7 6 8 1 7
## 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896
## 1 7 4 8 7 7 2 5 1 7 1 1 2 7 1 5
## 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912
## 6 4 7 5 8 2 4 1 2 7 2 1 8 3 1 1
## 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
## 8 3 2 8 1 3 7 8 7 8 5 5 4 1 8 3
## 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944
## 2 3 5 8 4 2 4 6 3 8 6 4 1 5 7 2
## 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960
## 7 7 4 7 1 4 4 7 2 4 1 5 1 1 1 8
## 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976
## 6 2 3 1 5 8 1 1 5 3 6 1 1 2 5 7
## 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992
## 6 7 1 3 8 7 2 4 8 7 2 2 6 2 5 8
## 993 994 995 996 997 998 999 1000
## 2 2 7 8 6 1 6 7
##
## Within cluster sum of squares by cluster:
## [1] 797.7227 949.7918 641.1214 793.4957 737.4096 520.4053 918.1900 761.7654
## (between_SS / total_SS = 31.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
We can observe that the eight different clusters have been found with sizes 136, 149, 100, 132,93, 139 and 129 respectively, and the within cluster sum of square (WCSS) = 31.9%. which is higher than 2 and 4 clusters result which means 2,4 clusters are better in terms of compactness or homogeneity compared to the clustering result of 8 clusters.
Cluster Plot:
# 2- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeansresult, data = cdataset)
It’s clear that the eight clusters are overlapping.
#Average silhouette
library(cluster)
avg_sil <- silhouette(kmeansresult$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
## cluster size ave.sil.width
## 1 1 136 0.12
## 2 2 149 0.09
## 3 3 100 0.08
## 4 4 132 0.12
## 5 5 122 0.10
## 6 6 93 0.13
## 7 7 139 0.06
## 8 8 129 0.12
An Average Silhouette Coefficient of 0.1 indicates that, the clusters formed in the clustering process have some degree of similarity among their data points. However, the result is lower than 2 clusters which has silhouette coefficient average of 0.11 and also lower than K=4 clusters that is equal to 0.12.
# Cluster assignments and ground truth labels
cluster_assignments <- kmeansresult$cluster
ground_truth <- dataset$Risk
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
n <- length(cluster_assignments)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- cluster_assignments[i]
label <- ground_truth[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(cluster_assignments == cluster)
# Count the total number of items with the same category
total_same_category <- sum(ground_truth == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
precision <- precision_sum / n # Calculate average precision
recall <- recall_sum / n # Calculate average recall
return(list(precision = precision, recall = recall)) }
# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)
# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall
# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
## BCubed Precision: 0.3747497
## BCubed Recall: 0.1554135
The calculated precision value is 0.37478 not a high value it mean the clusters are not pure.
The calculated recall value is 0.15541 it’s a low result meaning most of the data are not in the same cluster.
Conclusion of K=8:
Is not a good number of clusters especially when compared to the results obtained with K=2 and K=4 clusters. This conclusion is based on various evaluation metrics such as the average silhouette coefficient, within-cluster sum of squares, and Bcubed precision and recall. In all aspects, K=8 performed the worst. Additionally, considering the presence of class labels and our prior knowledge of the data set, we know the actual number of groups within the class label. So, by also taking this information into account, we can determine that K=8 is not an optimal number of clusters.
library(NbClust)
#a)fviz_nbclust() with silhouette method using library(factoextra)
fviz_nbclust(cdataset, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
#b) NbClust validation
fres.nbclust <- NbClust(cdataset, distance="euclidean", min.nc = 2, max.nc = 10, method="kmeans", index="all")
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning: did not converge in 10 iterations
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 8 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
According to the NbClust validation method, which utilizes the majority rule, the best number of clusters is 4. This number contradicts the initial suggestion from the silhouette method, which indicated that the best number of clusters is 2. However, upon revisiting the calculations and evaluating the performance, it is almost accurate to conclude that K=4 indeed performs the best among the considered options.
| 80 %t raining set 20% testing set: | 70% raining set 30% testing set: | 60% raining set 40% testing set: |
| IG | IG ratio | Gini Index | IG | IG ratio | Gini Index | IG | IG ratio | Gini Index | |
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 82.27% |
81.01% | 54.75% | 78.07% | 79.39% | 60.31% | 78.37% | 78.38% | 57.39% |
| Sensitivity | 80.28% |
78.26% | 51.02% | 77.23% | 78.00% | 55.10% | 80.60% | 73.72% | 50.81% |
| Specificity | 82.85% |
81.78% | 56.42% | 78.31% | 79.78% | 62.78% | 77.60% | 79.92% | 60.14% |
| Precision | 75.00% |
71.05% | 54.35% | 72.90% | 72.90% | 64.29% | 72.26% | 74.19% | 53.41% |
The Information Gain (IG) model, when trained with an 80% training set, stands out due to its exceptional accuracy (82.27%) and sensitivity (80.28%). This suggests that the model is adept at correctly identifying patients with a risk of ASCVD, a critical factor in preventative health measures. The high specificity (82.85%) further underscores the model’s ability to discern true negatives, minimizing false alarms and unnecessary treatments.
In each partition based on the metrics results, the decision tree
algorithm demonstrated varying degrees of performance. The 80-20 split
emerged as the most favorable for all metrics, showcasing the highest
accuracy, balanced sensitivity and specificity, and commendable
precision. This suggests that the model trained on 80% of the data and
tested on the remaining 20% achieved the most reliable predictions for
the 10-year ASCVD risk.
Comparing the three splits, the 80-20 configuration consistently outperformed the others, making it the preferred choice among the algorithms. It demonstrated superior accuracy, successfully navigating the intricacies of the dataset and maintaining a balance between correctly identifying positive and negative instances. While the 70-30 and 60-40 splits exhibited respectable performances, they fell short of the comprehensive reliability achieved by the 80-20 split.
In summary, the 80-20 split with the decision tree algorithm emerged as the optimal configuration, providing the most accurate and balanced predictions for the 10-year ASCVD risk. This analysis underscores the importance of careful consideration in choosing the training and testing split, with the 80-20 partition demonstrating its efficacy as the best-performing algorithm among all evaluated configurations.
We tested 3 different number of clusters : K=2, K=4, K=8
| K= 2 | K= 4 | K= 8 | |
|---|---|---|---|
| Average Silhouette width | 0.11 | 0.12 | 0.1 |
| Total within-cluster sum of square | 11.2% | 22.5% | 31.9% |
| BCubed precision | 0.3299589 | 0.336335 | 0.3747497 |
| BCubed recall | 0.5317886 | 0.2729542 | 0.1554135 |
| Visualization |
In an overall comparison of the clustering results based on the metrics provided (Average Silhouette Width, Total Within-Cluster Sum of Square, BCubed Precision, and BCubed Recall) for different numbers of clusters (K=2, K=4, K=8)
we can say that K=4 has the highest average silhouette width, indicating well-defined and distinct clusters. The total within-cluster sum of square is higher, suggesting that the clusters are less compact. The precision is slightly better than K=2, but the recall is lower .based on the result of average silhouette width It suggests that partitioning the data into four clusters is favorable.
K=2 The average silhouette width is moderate, indicating some separation between clusters. it has the lowest total within-cluster sum of square percentage, suggests that the two clusters are relatively compact. The precision is decent, suggesting that the instances within each cluster are somewhat similar. However, the high recall indicates that there might be some instances that are not well captured by the clusters.
K=8 has the lowest average silhouette width, indicating less separation between clusters. The total within-cluster sum of square is the highest, suggesting that the clusters are less compact. The precision is the highest, but the recall is the lowest among the three. This may imply that while the instances within each cluster are similar, many relevant instances are missed.
K=4 a good choice as it has a higher average silhouette width compared to K=2.
K=8, based on these metrics, appears to have less favorable results compared to K=2 and K=4.
Comparison: Classification vs. Clustering
In this study, classification algorithms consistently outperform clustering algorithms in accurately predicting outcomes based on the provided features. While clustering may reveal inherent patterns and groupings within the data, it might not be as effective in predicting specific classes as witnessed in classification. Therefore, for this dataset and problem, classification appears to be the more suitable approach.
In the beginning, we selected a dataset that represents a 1000 generated samples with different kinds of health condition to predict the probability of having a 10-year ASCVD risk.
To ensure the highest level of efficiency and the most accurate results, we implemented a series of preprocessing steps. By using clear visual representations such as boxplots and histograms, we were able to get a clear picture of our data’s characteristics. This allowed us to effectively identify and remove any irregularities, such as missing information or statistical outliers, which could potentially distort our results. We then applied normalization and data balancing techniques, which adjusted the scales of our data features to a uniform range, and discretized the continuous ‘Risk’ variable into distinct categories, thereby simplifying the interpretation of risk levels for our classification tasks.
With our data prepared, we embarked on the core tasks of classification and clustering. Our tool of choice for the former was the decision tree model, tested across 3 different splits of training and testing data to get the best model’s accuracy. Our techniques yielded the following results:
for the classification, the 80-20 split with the decision tree algorithm provided the most accurate in all models especially in the The Information Gain (IG) model stands out due to its exceptional accuracy (82.27%), Sensitivity, Specificity and Precision. the key findings for the tree are :
Age as a Significant Predictor: Across all trees, age consistently appears as a significant factor and serving as the root node. This underlines the model’s reliance on age as a primary risk indicator,
Systolic Blood Pressure and HDL Cholesterol: These two health metrics are frequently used as secondary splits following age, indicating their importance in cardiovascular risk assessment. Higher systolic blood pressure is generally associated with higher risk, while higher HDL cholesterol levels often indicate lower risk.
Diabetes and smoking status further refine risk predictions, with diabetic individuals generally at a higher risk.
The presence of hypertension, especially in combination with other risk factors like high cholesterol, elevates the risk level.
As for clustering, we utilized the K-means clustering algorithm with different values of K to determine the optimal number of clusters. and evaluated the performance of each K value by analyzing various metrics, including the average silhouette width. Here are the key findings:
Among the tested K values, K=4 yielded the most favorable result.
The average silhouette width for K=4 was calculated to be 0.12, indicating better separation between clusters compared to other K values.
Following the majority rule, the optimal number of clusters for the dataset was determined to be 4.
Analyzing the scree plot, a notable observation is that the total within-cluster sum of squares (WCSS) decreases as the number of clusters increases. The selection of the optimal number of clusters is determined by identifying an “elbow” point.
# Decide how many clusters to look at
n_clusters <- 10
# Initialize total within sum of squares error: wss
wss <- numeric(n_clusters)
set.seed(123)
# Look over 1 to n possible clusters
for (i in 1:10) {
# Fit the model: km.out
km.out <- kmeans(cdataset, centers = i, nstart = 20)
# Save the within cluster sum of squares
wss[i] <- km.out$tot.withinss
}
# Produce a scree plot
wss_df <- tibble(clusters = 1:10, wss = wss)
scree_plot <- ggplot(wss_df, aes(x = clusters, y = wss, group = 1)) +
geom_point(size = 4)+
geom_line() +
scale_x_continuous(breaks = c(2, 4, 6, 8, 10)) +
xlab('Number of clusters')
scree_plot
scree_plot +
geom_hline(
yintercept = wss,
linetype = 'dashed',
col = c(rep('#000000',3),'#FF0000', rep('#000000', 6))
)
The identified elbow point corresponds to K=4, indicating that the WCSS decreases at a slower rate beyond this number of clusters. Thus, K=4 is considered the suitable number of clusters based on this criterion.
All these mentioned findings highlight the effectiveness of utilizing the K-means algorithm with K=4 in achieving the highest level of separation among the clusters under consideration.
In summary, both the supervised learning model (Classification) and the unsupervised learning model (Clustering) played crucial roles in predicting the 10-year ASCVD risk in adults using key features, contributing to the successful accomplishment of our goal.
The supervised learning model (Classification), benefiting from the inclusion of the class label “Risk” in the dataset, proved to be more accurate, precise, and suitable for the task.
On the other hand, the unsupervised learning model (Clustering) encountered challenges in achieving pure clusters due to the absence of labeled data. Despite this limitation, it still provided valuable insights into the underlying patterns and structures within the dataset.
[1] “Data Preprocessing in R,” Engineering Education (EngEd) Program | Section. https://www.section.io/engineering-education/data-preprocessing-in-r/
[2] “K-Means Clustering in R with Step by Step Code Examples,” www.datacamp.com. https://www.datacamp.com/tutorial/k-means-clustering-r
[3] M. Sarah, “A Comprehensive Guide to Cluster Analysis: Applications, Best Practices and Resources,” Displayr, Jun. 06, 2023. https://www.displayr.com/understanding-cluster-analysis-a-comprehensive-guide/
[4] “RPubs - Data Mining: Classification with Decision Trees,” rpubs.com. https://rpubs.com/kjmazidi/195428
[5 ] “RPubs - Classification and Regression Trees (CART) in R,” rpubs.com. https://rpubs.com/camguild/803096